The most powerful application of language models is writing NLP apps

There’s a strong buzz around large language models (LLMs) like GPT-3 for building new NLP products. But there’s a big question of what kind of products make sense.

LLMs aren’t good enough to use straight out of the box, and may never be. But they can be powerful for users willing to iterate on outputs. In other words, it’s pretty great for writers.

There’s one kind of writing that really matters above the rest at this point in time: writing software. Like software, great writing offers great leverage when plugged into the right channels, but average writing ability is quite common while average software skills are not.

If software skills are hard to come by, natural language processing (NLP) skills are exceptionally rare. Contrast that with the enormous unrealized potential of NLP in our world. Consider how every aspect of our pre-metaverse Internet involves text, or how the lifeblood of every human organization on Earth flows in text form, whether that’s emails, Slack messages, or documents. And where there is text, there is knowledge.

If the usefulness of knowledge feels abstract, think about that one email you can’t seem to find, or how often you can Google something and find exactly what you need. If the computer is a bicycle for the mind, search is sticking a twin-turbo engine on it. Search is important, and good search requires good knowledge.

The reality of today is that knowledge locked away in text is largely untapped because NLP is some of the most challenging software to write. We’ll dig into why that is and how LLMs could change that.

Why NLP Software is Hard

The value of text data is gated by our ability to summarize it. Whether that is extracting information, like facts, events, or people, surfacing the most insightful information in a document, or answering questions using a collection of documents, it all boils down to summarization of some kind.

But NLP software is hard to write precisely because language is so rich and expressive. Writing down rules in code is very brittle, and while machine learning is more adaptive it is also very expensive and time-consuming. LLMs offer us a potentially easier way to interface with text for building NLP apps.

Let’s consider this example sentence from a medical note written by a doctor:

ROS: no nausea, rash, arthralagias, fever / chills, urinary symptoms

We want to extract the existence or explicit inexistence of the patient’s symptoms.

Detecting lists of symptoms can be tricky—perhaps finding comma-separated entities that match to a pre-existing list of symptoms can work. That’s very brittle (what about other symbols for separating lists, like semicolons?), and while a named-entity recognition model could do a better job it will be work to integrate your list, build label sets or rules for it, train it, and analyze its performance. Extracting entities isn’t that easy, and it takes significant work.

Then comes the pesky “no”. Is it referring to “nausea” or the whole list? Ideally modifiers like “no” would exist in front of each entity to resolve ambiguity, but that’s not the case so a simple rule of “does ‘no’ precede the entity” won’t work. Perhaps the linear data structure of the sentence is the issue, and using a more advanced structure based on grammar can help.

Maybe we can traverse this dependency tree to write a rule:

A dependency parse tree showing the different parts of speech and dependency labels for the above sentence.

The first step is to understand all of those labels. And that’s not even the hard part–knowing what rules can be effective comes through challenging and hard-won experience. For instance, observe the numerous differences in the parse tree just by adding “or” before “urinary symptoms” at the end:

A different dependency parse tree generated by adding a single word “or” to the the previous example.

At this point we’re pretty far from what most engineers or data scientists understand about NLP.

Maybe we can just try ML. What does it take? Hire some medical annotators to write down a structured list of symptoms for each note. Good medical annotators are very expensive and hard to find, but it’s doable. Then train an off-the-shelf model to do the same. Not easy, it requires extensive infrastructure, but the knowledge is out there.

But then your only control mechanism for when it doesn’t work well enough is to collect more data (tuning the model is generally unproductive). It’s hard to know what data to label, it’s expensive to label it, testing is labor intensive and boring, and it’s very slow to iterate.

Seems like we’re stuck between a rock (classical NLP) and a hard place (ML).

LLMs Offer New Capabilities

If the example above was a little more structured, we can imagine building simpler rules to extract information.

For example, if the note was in this form:

Review of systems:
* No nausea
* No rash
* No arthralagias (joint pains)
* No fever or chills
* No urinary symptoms

It would be a lot easier to work with. Just split on the newlines, remove asterisks, match entities against a list, and detect negations.

Well, the above was actually generated using Open AI’s text-davinci-edit-001 model with the following prompt:

Expand abbreviations
No slashes
Clarify negations
Turn into bullet list

(You can try this yourself here.)

There’s a fair amount of variability with that prompt, but trying a simpler one like ROS -> Review of Systems leads to very consistent results.

Let’s run with that idea. Imagine creating a pipeline of transformations using an LLM with natural language prompts (like we just did) as a way to bootstrap an NLP app. It may contain errors, but that’s not new. Whether it’s LLMs, ML systems, or classical NLP, these transformations will require systematic and continuous testing, evaluation, error analysis, and monitoring for the entire life of the app.

(Side bar: data infrastructure for NLP is severely lacking too, and that’s not a problem LLMs can solve. But I’ll address that another time.)

Notice how in the LLM case you don’t need deep software or scientific skills to build that NLP pipeline. Working with LLMs is akin to learning to use Google search effectively.

A New Day for NLP

As an NLP practitioner, I often look at other data-intensive fields for inspiration. To take some examples, fraud detection and product analytics are mature enough that they have dedicated technical practitioners—fraud analysts, data analysts, and analytics engineers are new roles introduced to serve the growing demand. These roles straddle the boundaries of product management, data science, software engineering, and domain experience (such as consumer financial transactions) to become productive for those problems. They replace those larger interdisciplinary teams with a single person.

Can we accomplish the same in NLP?

Today the sheer complexity required to build NLP apps means a few expensive successes and a great many failures. They often require large teams of labelers, product managers, domain experts, NLP and ML scientists, data and ML engineers, and even the odd infrastructure or fullstack engineer.

In other words, NLP projects are some of the most expensive in all of software. Just pulling the team together alone is a $2m+/year expenditure.

If developing NLP apps in a new way with LLMs can be made to work, we have the potential to leapfrog much of that cost and complexity. It won’t be perfect in the beginning, but neither is the old way.

And if it does work, we may find a new technical role emerge: the NLP analyst, where you won’t need a PhD in linguistics or expert ML engineering skills to do it. And the macroeconomic consequences of that will be extraordinary.

(If you find this future compelling, let’s chat.)

Labeling Data is a Mistake

The biggest problems in ML start with data. The best way to get good data is to build an app where users generate good data. Ideally this app is enough of a value add to be financially sustainable too.

Labeling is basically a hack to approximate that process. Instead of attracting users, you hire them for cheap. Instead of building a great product with your users in mind, you buy a labeling tool and shoehorn a problem into one of the templates. There’s no real customizability, and worse, there’s no empathy.

Aside from making labeling a terrible job, it also makes many ML problems intractable. Doctors, lawyers, and chemical engineers don’t sit down and label data one at a time. Yet many who work on those problems understand much better ways of solving it. The best teams will build their product, processes, and organizations more seriously with the technology needed to do the job. (I believe this is a major way AI startups will differentiate themselves, but it will take time to see.)

One obvious barrier to doing better is cost. Building a full-fledged product is expensive and takes time. Strong product builders require a variety of skills and experience to do it well. ML teams often don’t get the resources they need, even if they have good product sense for the problem they’re solving.

One way through this problem is tools. Often ML engineers have a good enough view into the problem to do much better than a labeling tool. But the ML skillset rarely includes UI programming or backend engineering. It’s not trivial to build an app, but these days entire startups are built singlehandedly. And the space of what ML apps would do isn’t exceptionally large–to do better than an off-the-shelf labeling tool doesn’t take much.

lI’ll have to stop here for now, as long as I want to keep my current work in stealth. 🙂

(But if this interests you, I’ll happily share my less-baked ideas.)