Generative AI is the Interface for Software 2.0

Generative AI automates creativity. That’s one way of thinking about it, but it’s wrong. The issue is “automate”—what people think it means (automating jobs) is very different from what it actually does (automate tasks). Besides, the definition of a job is always changing. In fact, that’s typically the hallmark of a good job.

It’s also wrong in a different way. Automation conjures up ideas of servers blinking away without human involvement. But as every on-call engineer knows, it takes enormous effort to run systems that seem automatic. Engineering teams chip away at that human involvement, but thinking it can be solved completely is a pipe dream. AI is no different in that respect, but it does make programming meaningfully easier.

Andrej Karpathy introduced software 2.0 roughly 5 years ago, defining a new way to program computers by instructing them through weights of a neural network. It turns out that’s really hard to do, and an entire industry called MLOps was built to address it. That was helpful for growing the pool of engineers who could build AI systems, but I argue it was missing a critical feature: an interface.

Simple tasks done on a computer through manual effort like clicking through screens and filling out forms is often called usage. When we collect instructions for the computer to run we call that programming. Generative AI brings a new interface for both.

What’s the difference? Is writing a Stable Diffusion prompt for generating an image more similar to writing a computer program or filling out the form to post this essay? Both control the computer and there are some instructions behind each action. I don’t think a great definition exists, and the best measure I can come up with is complexity of instruction. Clicking a checkbox or submit button communicates just one thing so counts as usage, whereas writing a spreadsheet formula does many which falls under programming. Imperfect, but gets the job done.

The above is useful I hope, because I want to distinguish between a usage scenario with generative AI from a programming one.

If a computer needs multiple steps to do something, each step needs to be correct. Say a robot coffee-maker gets some measurements wrong. Individually they are minor mistakes that add up to inedible amphetamine juice. We can either put a human between each step (usage), or put in the requisite work to ensure it will be correct without intervention (programming).

The more a human-computer interaction looks like programming, the more correctness becomes important. Searching for an article or generating blog ideas need to be useful but not super correct, but that’s not true if you’re handling a support question for a customer that pulls or disburses funds from an account.

You could say programming with generative models means perfecting a prompt to generate a beautiful image or answer a question appropriately. But using a single prompt for the entire task is an anti-pattern. There are multiple steps that all need to work properly for a correct outcome. It’s very difficult to see what’s going on and do something about it if they’re all baked into a single prompt.

Programming frameworks like Langchain and Primer try to address this issue, and to some extent ChatGPT too. Langchain and Primer build programs where some steps are done with generative models and others are done with code. This is an important evolution, because as we mentioned we need independent steps for inspection and correction. ChatGPT fits this model since the chat experience essentially splits a task into multiple steps. The difference is that all steps are done through a generative model.

Writing instructions is half of programming. The other half is debugging, and breaking down programs is an important precursor to debugging because it improves our ability to inspect the program. Debugging AI models comes down to inspecting data, since data samples act as probes into how the model thinks. Looking at similar, counterfactual, or adversarial examples in conjunction with how predictions change can be a powerful method for understanding.

This is the inner loop of programming. The outer loop requires scaling the ability to write and debug programs quickly and without regression. It requires continuously gathering what we know about correctness into code changes and tests. In time our users become our testers, and we’ll need to turn their work into more tests. With normal software that ends up as a simple code change or unit/integration test. With AI models, that’s typically turned into training data, evaluation data, edge-case tests, and distributional tests.

Turns out programming with AI is still a lot of work! We’re not at the point where every human is now a programmer, but it does open the door to more of them. The reason is subtle but clear: traditional programming requires learning the language of a computer, but AI programming only requires learning the language of data.

But the language of data is simply understanding how the world works. Programming AIs for contracts simply means defining objectives (extract the non-compete clause, parse dates and addresses, here’s what a signature block looks like) and iterating on data examples until it works. Programming AIs for self-driving cars, medical imaging, and a whole host of other things works the same. And there are a heck of a lot more people who understand the world through data than people who can write in traditional programming languages.

The real accomplishment of generative AI is opening computer programming to a lot more people, but there’s more work to do. What’s left to make it a reality?

(Something that I’m actively working on. :))