Generative AI is the Interface for Software 2.0

Generative AI automates creativity. That’s one way of thinking about it, but it’s wrong. The issue is “automate”—what people think it means (automating jobs) is very different from what it actually does (automate tasks). Besides, the definition of a job is always changing. In fact, that’s typically the hallmark of a good job.

It’s also wrong in a different way. Automation conjures up ideas of servers blinking away without human involvement. But as every on-call engineer knows, it takes enormous effort to run systems that seem automatic. Engineering teams chip away at that human involvement, but thinking it can be solved completely is a pipe dream. AI is no different in that respect, but it does make programming meaningfully easier.

Andrej Karpathy introduced software 2.0 roughly 5 years ago, defining a new way to program computers by instructing them through weights of a neural network. It turns out that’s really hard to do, and an entire industry called MLOps was built to address it. That was helpful for growing the pool of engineers who could build AI systems, but I argue it was missing a critical feature: an interface.

Simple tasks done on a computer through manual effort like clicking through screens and filling out forms is often called usage. When we collect instructions for the computer to run we call that programming. Generative AI brings a new interface for both.

What’s the difference? Is writing a Stable Diffusion prompt for generating an image more similar to writing a computer program or filling out the form to post this essay? Both control the computer and there are some instructions behind each action. I don’t think a great definition exists, and the best measure I can come up with is complexity of instruction. Clicking a checkbox or submit button communicates just one thing so counts as usage, whereas writing a spreadsheet formula does many which falls under programming. Imperfect, but gets the job done.

The above is useful I hope, because I want to distinguish between a usage scenario with generative AI from a programming one.

If a computer needs multiple steps to do something, each step needs to be correct. Say a robot coffee-maker gets some measurements wrong. Individually they are minor mistakes that add up to inedible amphetamine juice. We can either put a human between each step (usage), or put in the requisite work to ensure it will be correct without intervention (programming).

The more a human-computer interaction looks like programming, the more correctness becomes important. Searching for an article or generating blog ideas need to be useful but not super correct, but that’s not true if you’re handling a support question for a customer that pulls or disburses funds from an account.

You could say programming with generative models means perfecting a prompt to generate a beautiful image or answer a question appropriately. But using a single prompt for the entire task is an anti-pattern. There are multiple steps that all need to work properly for a correct outcome. It’s very difficult to see what’s going on and do something about it if they’re all baked into a single prompt.

Programming frameworks like Langchain and Primer try to address this issue, and to some extent ChatGPT too. Langchain and Primer build programs where some steps are done with generative models and others are done with code. This is an important evolution, because as we mentioned we need independent steps for inspection and correction. ChatGPT fits this model since the chat experience essentially splits a task into multiple steps. The difference is that all steps are done through a generative model.

Writing instructions is half of programming. The other half is debugging, and breaking down programs is an important precursor to debugging because it improves our ability to inspect the program. Debugging AI models comes down to inspecting data, since data samples act as probes into how the model thinks. Looking at similar, counterfactual, or adversarial examples in conjunction with how predictions change can be a powerful method for understanding.

This is the inner loop of programming. The outer loop requires scaling the ability to write and debug programs quickly and without regression. It requires continuously gathering what we know about correctness into code changes and tests. In time our users become our testers, and we’ll need to turn their work into more tests. With normal software that ends up as a simple code change or unit/integration test. With AI models, that’s typically turned into training data, evaluation data, edge-case tests, and distributional tests.

Turns out programming with AI is still a lot of work! We’re not at the point where every human is now a programmer, but it does open the door to more of them. The reason is subtle but clear: traditional programming requires learning the language of a computer, but AI programming only requires learning the language of data.

But the language of data is simply understanding how the world works. Programming AIs for contracts simply means defining objectives (extract the non-compete clause, parse dates and addresses, here’s what a signature block looks like) and iterating on data examples until it works. Programming AIs for self-driving cars, medical imaging, and a whole host of other things works the same. And there are a heck of a lot more people who understand the world through data than people who can write in traditional programming languages.

The real accomplishment of generative AI is opening computer programming to a lot more people, but there’s more work to do. What’s left to make it a reality?

(Something that I’m actively working on. :))

Gating and Depth in Neural Networks

Depth is a critical part of modern neural networks. They enable efficient representations through constructions of hierarchical rules. By now we all know it so I’ll assume I don’t need to convince anyone, but in case you need a refresher it’s basically because we cannot efficiently model many data distributions that appear in the wild with a single or few functions without exponential amounts of neurons. The distributions are simply to complex to model in such a direct way, but despite this they do have structure it just so happens to be hierarchical. To think of it another way, imagine if the data generating process were not hierarchical. Then generating a complex distribution would take exponential resources as well. No doubt there are processes that are like this. That said one thing we know about the world is that it is made out of composing simpler parts, but put together they produce extremely complicated behavior at times.

Unfortunately our network training algorithm, error backpropagation, doesn’t like it when things recurse too much. Consider the case of a chain of linear neurons. If the (norm of the) weights of those neurons is not equal to one, the error signal will either successively shrink or grow. If the shrinking is too fast we see that it vanishes, and if it grows too fast it is said that it explodes. The picture is roughly the same for neurons with nonlinear activation functions, except that we think about the norm of the Jacobians instead but it’s basically the same deal. The scaling is not cool because it means when we propagate errors backwards from the loss we see that the signal gets way out of whack, meaning the updates to the weights are useless and our network training diverges.

Great, we’re all familiar with vanishing and exploding gradients. What do we do about it? The basic strategy is to make sure all that scaling doesn’t mess with the error signal while still enabling depth. We need to protect some of that signal.

In feed-forward networks (FFNs) we can accomplish this with skip connections. The simplest case is residual learning, where the output from a lower layer to a higher layer is added to that higher layer’s output. Then successive layers only have to learn an increment, since they are already “starting” from where the previous layer was. This residual is argued to be easier to learn than building a new transformation from scratch.

Another reason why this works is that residual connections actually enable a shorter path from a given layer to the output. All you need to do is travel through the residual connections of successive layers until you get to the end, skipping the layers in between. As it happens, deep residual networks actually have an effective depth that is significantly smaller than the specified depth because this is exactly what happens in training. In other words, we have avoided a lot of unfavorable transformations. This is implicitly happening in residual nets, but can be explicitly made so by having a connection between every layer and the output which is called a dense network.

By the way, it is possible that starting from the previous layer’s output actually makes it more difficult to learn the best transformation to come next. Or at least, it could just be an inefficient use of many layers. So, we introduce the concept of a gate and say let’s learn a coefficient to attenuate how much the identity connection should be used versus the normal stacked layers. Then the residual networks become special cases of networks with weighted skip connections called highway networks.

This is all very interesting because in a sense we have implemented a memory for FFNs, and with gating mechanisms we can learn when to access that memory and when to ignore it. You’ve created some representation at a lower layer and that signal can skip many layers and find itself somewhere else. The (learned) gate on that somewhere else layer tells you whether to access that info or not.

Let us think about a different problem now: sequence modeling. FFNs are great for this of course, and sequences are a nice way to justify depth. If you are using a convolutional FFN (as everyone does nowadays), you actually require a certain level of depth to take into account long-range correlations in your input. This is exceptionally obvious in language modeling where long range correlations are potentially really long. Then skip connections and also things like attention are acting as memory lookup mechanisms for more fine-grained information to service the higher level representations. It’s all very cool.

There is another fun way to look at this too, and that is with recurrent networks (RNNs). In that case, we have a single hidden layer that is applied to every element in a sequence, and then after the final element you backprop your error “in time”. So as you can imagine, our gradients can and do vanish by the time they get back to the beginning of the sequence. However, since we use the same weights at each sequence step, adding a skip connection doesn’t make sense here.

What the RNN people have done instead is to put in other little memory systems using a gating mechanism. Now you can protect some part of your signal by writing it to a special state somewhere and learn when to read it back. So what does effective depth mean for an RNN? Not sure it is really comparable beyond conceptually thinking of depth in the sequence direction, but at any rate it is an interesting analogy.

It is actually pretty funny to think about how we haven’t solved the vanishing or exploding gradients problem at all, we just side stepped it. So now we can ask whether this is natural or weird. I happen to feel that it is still weird that we can pull hierarchical representations out of thin air so well. It is interesting to note the effectiveness of intermediate losses, particularly in NLP applications where one has this type of information due to the existence of various parsers that can generate features for you to predict (e.g., part-of-speech, named entities).

It is pretty obvious why this works and it’s related to skip connections. Simply, you are anchoring your representations to something besides the final layer’s output. In the residual FFN case, it is to the lower layer outputs, and this is no different in that it is also an additional signal you know is a priori a good representation to some degree (just like your inputs!).

It does seem for the moment that vanishing and exploding gradients are a fundamental, though not insurmountable, problem.