Generative AI is the Interface for Software 2.0

Generative AI automates creativity. That’s one way of thinking about it, but it’s wrong. The issue is “automate”—what people think it means (automating jobs) is very different from what it actually does (automate tasks). Besides, the definition of a job is always changing. In fact, that’s typically the hallmark of a good job.

It’s also wrong in a different way. Automation conjures up ideas of servers blinking away without human involvement. But as every on-call engineer knows, it takes enormous effort to run systems that seem automatic. Engineering teams chip away at that human involvement, but thinking it can be solved completely is a pipe dream. AI is no different in that respect, but it does make programming meaningfully easier.

Andrej Karpathy introduced software 2.0 roughly 5 years ago, defining a new way to program computers by instructing them through weights of a neural network. It turns out that’s really hard to do, and an entire industry called MLOps was built to address it. That was helpful for growing the pool of engineers who could build AI systems, but I argue it was missing a critical feature: an interface.

Simple tasks done on a computer through manual effort like clicking through screens and filling out forms is often called usage. When we collect instructions for the computer to run we call that programming. Generative AI brings a new interface for both.

What’s the difference? Is writing a Stable Diffusion prompt for generating an image more similar to writing a computer program or filling out the form to post this essay? Both control the computer and there are some instructions behind each action. I don’t think a great definition exists, and the best measure I can come up with is complexity of instruction. Clicking a checkbox or submit button communicates just one thing so counts as usage, whereas writing a spreadsheet formula does many which falls under programming. Imperfect, but gets the job done.

The above is useful I hope, because I want to distinguish between a usage scenario with generative AI from a programming one.

If a computer needs multiple steps to do something, each step needs to be correct. Say a robot coffee-maker gets some measurements wrong. Individually they are minor mistakes that add up to inedible amphetamine juice. We can either put a human between each step (usage), or put in the requisite work to ensure it will be correct without intervention (programming).

The more a human-computer interaction looks like programming, the more correctness becomes important. Searching for an article or generating blog ideas need to be useful but not super correct, but that’s not true if you’re handling a support question for a customer that pulls or disburses funds from an account.

You could say programming with generative models means perfecting a prompt to generate a beautiful image or answer a question appropriately. But using a single prompt for the entire task is an anti-pattern. There are multiple steps that all need to work properly for a correct outcome. It’s very difficult to see what’s going on and do something about it if they’re all baked into a single prompt.

Programming frameworks like Langchain and Primer try to address this issue, and to some extent ChatGPT too. Langchain and Primer build programs where some steps are done with generative models and others are done with code. This is an important evolution, because as we mentioned we need independent steps for inspection and correction. ChatGPT fits this model since the chat experience essentially splits a task into multiple steps. The difference is that all steps are done through a generative model.

Writing instructions is half of programming. The other half is debugging, and breaking down programs is an important precursor to debugging because it improves our ability to inspect the program. Debugging AI models comes down to inspecting data, since data samples act as probes into how the model thinks. Looking at similar, counterfactual, or adversarial examples in conjunction with how predictions change can be a powerful method for understanding.

This is the inner loop of programming. The outer loop requires scaling the ability to write and debug programs quickly and without regression. It requires continuously gathering what we know about correctness into code changes and tests. In time our users become our testers, and we’ll need to turn their work into more tests. With normal software that ends up as a simple code change or unit/integration test. With AI models, that’s typically turned into training data, evaluation data, edge-case tests, and distributional tests.

Turns out programming with AI is still a lot of work! We’re not at the point where every human is now a programmer, but it does open the door to more of them. The reason is subtle but clear: traditional programming requires learning the language of a computer, but AI programming only requires learning the language of data.

But the language of data is simply understanding how the world works. Programming AIs for contracts simply means defining objectives (extract the non-compete clause, parse dates and addresses, here’s what a signature block looks like) and iterating on data examples until it works. Programming AIs for self-driving cars, medical imaging, and a whole host of other things works the same. And there are a heck of a lot more people who understand the world through data than people who can write in traditional programming languages.

The real accomplishment of generative AI is opening computer programming to a lot more people, but there’s more work to do. What’s left to make it a reality?

(Something that I’m actively working on. :))

The most powerful application of language models is writing NLP apps

There’s a strong buzz around large language models (LLMs) like GPT-3 for building new NLP products. But there’s a big question of what kind of products make sense.

LLMs aren’t good enough to use straight out of the box, and may never be. But they can be powerful for users willing to iterate on outputs. In other words, it’s pretty great for writers.

There’s one kind of writing that really matters above the rest at this point in time: writing software. Like software, great writing offers great leverage when plugged into the right channels, but average writing ability is quite common while average software skills are not.

If software skills are hard to come by, natural language processing (NLP) skills are exceptionally rare. Contrast that with the enormous unrealized potential of NLP in our world. Consider how every aspect of our pre-metaverse Internet involves text, or how the lifeblood of every human organization on Earth flows in text form, whether that’s emails, Slack messages, or documents. And where there is text, there is knowledge.

If the usefulness of knowledge feels abstract, think about that one email you can’t seem to find, or how often you can Google something and find exactly what you need. If the computer is a bicycle for the mind, search is sticking a twin-turbo engine on it. Search is important, and good search requires good knowledge.

The reality of today is that knowledge locked away in text is largely untapped because NLP is some of the most challenging software to write. We’ll dig into why that is and how LLMs could change that.

Why NLP Software is Hard

The value of text data is gated by our ability to summarize it. Whether that is extracting information, like facts, events, or people, surfacing the most insightful information in a document, or answering questions using a collection of documents, it all boils down to summarization of some kind.

But NLP software is hard to write precisely because language is so rich and expressive. Writing down rules in code is very brittle, and while machine learning is more adaptive it is also very expensive and time-consuming. LLMs offer us a potentially easier way to interface with text for building NLP apps.

Let’s consider this example sentence from a medical note written by a doctor:

ROS: no nausea, rash, arthralagias, fever / chills, urinary symptoms

We want to extract the existence or explicit inexistence of the patient’s symptoms.

Detecting lists of symptoms can be tricky—perhaps finding comma-separated entities that match to a pre-existing list of symptoms can work. That’s very brittle (what about other symbols for separating lists, like semicolons?), and while a named-entity recognition model could do a better job it will be work to integrate your list, build label sets or rules for it, train it, and analyze its performance. Extracting entities isn’t that easy, and it takes significant work.

Then comes the pesky “no”. Is it referring to “nausea” or the whole list? Ideally modifiers like “no” would exist in front of each entity to resolve ambiguity, but that’s not the case so a simple rule of “does ‘no’ precede the entity” won’t work. Perhaps the linear data structure of the sentence is the issue, and using a more advanced structure based on grammar can help.

Maybe we can traverse this dependency tree to write a rule:

A dependency parse tree showing the different parts of speech and dependency labels for the above sentence.

The first step is to understand all of those labels. And that’s not even the hard part–knowing what rules can be effective comes through challenging and hard-won experience. For instance, observe the numerous differences in the parse tree just by adding “or” before “urinary symptoms” at the end:

A different dependency parse tree generated by adding a single word “or” to the the previous example.

At this point we’re pretty far from what most engineers or data scientists understand about NLP.

Maybe we can just try ML. What does it take? Hire some medical annotators to write down a structured list of symptoms for each note. Good medical annotators are very expensive and hard to find, but it’s doable. Then train an off-the-shelf model to do the same. Not easy, it requires extensive infrastructure, but the knowledge is out there.

But then your only control mechanism for when it doesn’t work well enough is to collect more data (tuning the model is generally unproductive). It’s hard to know what data to label, it’s expensive to label it, testing is labor intensive and boring, and it’s very slow to iterate.

Seems like we’re stuck between a rock (classical NLP) and a hard place (ML).

LLMs Offer New Capabilities

If the example above was a little more structured, we can imagine building simpler rules to extract information.

For example, if the note was in this form:

Review of systems:
* No nausea
* No rash
* No arthralagias (joint pains)
* No fever or chills
* No urinary symptoms

It would be a lot easier to work with. Just split on the newlines, remove asterisks, match entities against a list, and detect negations.

Well, the above was actually generated using Open AI’s text-davinci-edit-001 model with the following prompt:

Expand abbreviations
No slashes
Clarify negations
Turn into bullet list

(You can try this yourself here.)

There’s a fair amount of variability with that prompt, but trying a simpler one like ROS -> Review of Systems leads to very consistent results.

Let’s run with that idea. Imagine creating a pipeline of transformations using an LLM with natural language prompts (like we just did) as a way to bootstrap an NLP app. It may contain errors, but that’s not new. Whether it’s LLMs, ML systems, or classical NLP, these transformations will require systematic and continuous testing, evaluation, error analysis, and monitoring for the entire life of the app.

(Side bar: data infrastructure for NLP is severely lacking too, and that’s not a problem LLMs can solve. But I’ll address that another time.)

Notice how in the LLM case you don’t need deep software or scientific skills to build that NLP pipeline. Working with LLMs is akin to learning to use Google search effectively.

A New Day for NLP

As an NLP practitioner, I often look at other data-intensive fields for inspiration. To take some examples, fraud detection and product analytics are mature enough that they have dedicated technical practitioners—fraud analysts, data analysts, and analytics engineers are new roles introduced to serve the growing demand. These roles straddle the boundaries of product management, data science, software engineering, and domain experience (such as consumer financial transactions) to become productive for those problems. They replace those larger interdisciplinary teams with a single person.

Can we accomplish the same in NLP?

Today the sheer complexity required to build NLP apps means a few expensive successes and a great many failures. They often require large teams of labelers, product managers, domain experts, NLP and ML scientists, data and ML engineers, and even the odd infrastructure or fullstack engineer.

In other words, NLP projects are some of the most expensive in all of software. Just pulling the team together alone is a $2m+/year expenditure.

If developing NLP apps in a new way with LLMs can be made to work, we have the potential to leapfrog much of that cost and complexity. It won’t be perfect in the beginning, but neither is the old way.

And if it does work, we may find a new technical role emerge: the NLP analyst, where you won’t need a PhD in linguistics or expert ML engineering skills to do it. And the macroeconomic consequences of that will be extraordinary.

(If you find this future compelling, let’s chat.)

Labeling Data is a Mistake

The biggest problems in ML start with data. The best way to get good data is to build an app where users generate good data. Ideally this app is enough of a value add to be financially sustainable too.

Labeling is basically a hack to approximate that process. Instead of attracting users, you hire them for cheap. Instead of building a great product with your users in mind, you buy a labeling tool and shoehorn a problem into one of the templates. There’s no real customizability, and worse, there’s no empathy.

Aside from making labeling a terrible job, it also makes many ML problems intractable. Doctors, lawyers, and chemical engineers don’t sit down and label data one at a time. Yet many who work on those problems understand much better ways of solving it. The best teams will build their product, processes, and organizations more seriously with the technology needed to do the job. (I believe this is a major way AI startups will differentiate themselves, but it will take time to see.)

One obvious barrier to doing better is cost. Building a full-fledged product is expensive and takes time. Strong product builders require a variety of skills and experience to do it well. ML teams often don’t get the resources they need, even if they have good product sense for the problem they’re solving.

One way through this problem is tools. Often ML engineers have a good enough view into the problem to do much better than a labeling tool. But the ML skillset rarely includes UI programming or backend engineering. It’s not trivial to build an app, but these days entire startups are built singlehandedly. And the space of what ML apps would do isn’t exceptionally large–to do better than an off-the-shelf labeling tool doesn’t take much.

lI’ll have to stop here for now, as long as I want to keep my current work in stealth. 🙂

(But if this interests you, I’ll happily share my less-baked ideas.)