The most powerful application of language models is writing NLP apps

There’s a strong buzz around large language models (LLMs) like GPT-3 for building new NLP products. But there’s a big question of what kind of products make sense.

LLMs aren’t good enough to use straight out of the box, and may never be. But they can be powerful for users willing to iterate on outputs. In other words, it’s pretty great for writers.

There’s one kind of writing that really matters above the rest at this point in time: writing software. Like software, great writing offers great leverage when plugged into the right channels, but average writing ability is quite common while average software skills are not.

If software skills are hard to come by, natural language processing (NLP) skills are exceptionally rare. Contrast that with the enormous unrealized potential of NLP in our world. Consider how every aspect of our pre-metaverse Internet involves text, or how the lifeblood of every human organization on Earth flows in text form, whether that’s emails, Slack messages, or documents. And where there is text, there is knowledge.

If the usefulness of knowledge feels abstract, think about that one email you can’t seem to find, or how often you can Google something and find exactly what you need. If the computer is a bicycle for the mind, search is sticking a twin-turbo engine on it. Search is important, and good search requires good knowledge.

The reality of today is that knowledge locked away in text is largely untapped because NLP is some of the most challenging software to write. We’ll dig into why that is and how LLMs could change that.

Why NLP Software is Hard

The value of text data is gated by our ability to summarize it. Whether that is extracting information, like facts, events, or people, surfacing the most insightful information in a document, or answering questions using a collection of documents, it all boils down to summarization of some kind.

But NLP software is hard to write precisely because language is so rich and expressive. Writing down rules in code is very brittle, and while machine learning is more adaptive it is also very expensive and time-consuming. LLMs offer us a potentially easier way to interface with text for building NLP apps.

Let’s consider this example sentence from a medical note written by a doctor:

ROS: no nausea, rash, arthralagias, fever / chills, urinary symptoms

We want to extract the existence or explicit inexistence of the patient’s symptoms.

Detecting lists of symptoms can be tricky—perhaps finding comma-separated entities that match to a pre-existing list of symptoms can work. That’s very brittle (what about other symbols for separating lists, like semicolons?), and while a named-entity recognition model could do a better job it will be work to integrate your list, build label sets or rules for it, train it, and analyze its performance. Extracting entities isn’t that easy, and it takes significant work.

Then comes the pesky “no”. Is it referring to “nausea” or the whole list? Ideally modifiers like “no” would exist in front of each entity to resolve ambiguity, but that’s not the case so a simple rule of “does ‘no’ precede the entity” won’t work. Perhaps the linear data structure of the sentence is the issue, and using a more advanced structure based on grammar can help.

Maybe we can traverse this dependency tree to write a rule:

A dependency parse tree showing the different parts of speech and dependency labels for the above sentence.

The first step is to understand all of those labels. And that’s not even the hard part–knowing what rules can be effective comes through challenging and hard-won experience. For instance, observe the numerous differences in the parse tree just by adding “or” before “urinary symptoms” at the end:

A different dependency parse tree generated by adding a single word “or” to the the previous example.

At this point we’re pretty far from what most engineers or data scientists understand about NLP.

Maybe we can just try ML. What does it take? Hire some medical annotators to write down a structured list of symptoms for each note. Good medical annotators are very expensive and hard to find, but it’s doable. Then train an off-the-shelf model to do the same. Not easy, it requires extensive infrastructure, but the knowledge is out there.

But then your only control mechanism for when it doesn’t work well enough is to collect more data (tuning the model is generally unproductive). It’s hard to know what data to label, it’s expensive to label it, testing is labor intensive and boring, and it’s very slow to iterate.

Seems like we’re stuck between a rock (classical NLP) and a hard place (ML).

LLMs Offer New Capabilities

If the example above was a little more structured, we can imagine building simpler rules to extract information.

For example, if the note was in this form:

Review of systems:
* No nausea
* No rash
* No arthralagias (joint pains)
* No fever or chills
* No urinary symptoms

It would be a lot easier to work with. Just split on the newlines, remove asterisks, match entities against a list, and detect negations.

Well, the above was actually generated using Open AI’s text-davinci-edit-001 model with the following prompt:

Expand abbreviations
No slashes
Clarify negations
Turn into bullet list

(You can try this yourself here.)

There’s a fair amount of variability with that prompt, but trying a simpler one like ROS -> Review of Systems leads to very consistent results.

Let’s run with that idea. Imagine creating a pipeline of transformations using an LLM with natural language prompts (like we just did) as a way to bootstrap an NLP app. It may contain errors, but that’s not new. Whether it’s LLMs, ML systems, or classical NLP, these transformations will require systematic and continuous testing, evaluation, error analysis, and monitoring for the entire life of the app.

(Side bar: data infrastructure for NLP is severely lacking too, and that’s not a problem LLMs can solve. But I’ll address that another time.)

Notice how in the LLM case you don’t need deep software or scientific skills to build that NLP pipeline. Working with LLMs is akin to learning to use Google search effectively.

A New Day for NLP

As an NLP practitioner, I often look at other data-intensive fields for inspiration. To take some examples, fraud detection and product analytics are mature enough that they have dedicated technical practitioners—fraud analysts, data analysts, and analytics engineers are new roles introduced to serve the growing demand. These roles straddle the boundaries of product management, data science, software engineering, and domain experience (such as consumer financial transactions) to become productive for those problems. They replace those larger interdisciplinary teams with a single person.

Can we accomplish the same in NLP?

Today the sheer complexity required to build NLP apps means a few expensive successes and a great many failures. They often require large teams of labelers, product managers, domain experts, NLP and ML scientists, data and ML engineers, and even the odd infrastructure or fullstack engineer.

In other words, NLP projects are some of the most expensive in all of software. Just pulling the team together alone is a $2m+/year expenditure.

If developing NLP apps in a new way with LLMs can be made to work, we have the potential to leapfrog much of that cost and complexity. It won’t be perfect in the beginning, but neither is the old way.

And if it does work, we may find a new technical role emerge: the NLP analyst, where you won’t need a PhD in linguistics or expert ML engineering skills to do it. And the macroeconomic consequences of that will be extraordinary.

(If you find this future compelling, let’s chat.)

Labeling Data is a Mistake

The biggest problems in ML start with data. The best way to get good data is to build an app where users generate good data. Ideally this app is enough of a value add to be financially sustainable too.

Labeling is basically a hack to approximate that process. Instead of attracting users, you hire them for cheap. Instead of building a great product with your users in mind, you buy a labeling tool and shoehorn a problem into one of the templates. There’s no real customizability, and worse, there’s no empathy.

Aside from making labeling a terrible job, it also makes many ML problems intractable. Doctors, lawyers, and chemical engineers don’t sit down and label data one at a time. Yet many who work on those problems understand much better ways of solving it. The best teams will build their product, processes, and organizations more seriously with the technology needed to do the job. (I believe this is a major way AI startups will differentiate themselves, but it will take time to see.)

One obvious barrier to doing better is cost. Building a full-fledged product is expensive and takes time. Strong product builders require a variety of skills and experience to do it well. ML teams often don’t get the resources they need, even if they have good product sense for the problem they’re solving.

One way through this problem is tools. Often ML engineers have a good enough view into the problem to do much better than a labeling tool. But the ML skillset rarely includes UI programming or backend engineering. It’s not trivial to build an app, but these days entire startups are built singlehandedly. And the space of what ML apps would do isn’t exceptionally large–to do better than an off-the-shelf labeling tool doesn’t take much.

lI’ll have to stop here for now, as long as I want to keep my current work in stealth. 🙂

(But if this interests you, I’ll happily share my less-baked ideas.)

Gating and Depth in Neural Networks

Depth is a critical part of modern neural networks. They enable efficient representations through constructions of hierarchical rules. By now we all know it so I’ll assume I don’t need to convince anyone, but in case you need a refresher it’s basically because we cannot efficiently model many data distributions that appear in the wild with a single or few functions without exponential amounts of neurons. The distributions are simply to complex to model in such a direct way, but despite this they do have structure it just so happens to be hierarchical. To think of it another way, imagine if the data generating process were not hierarchical. Then generating a complex distribution would take exponential resources as well. No doubt there are processes that are like this. That said one thing we know about the world is that it is made out of composing simpler parts, but put together they produce extremely complicated behavior at times.

Unfortunately our network training algorithm, error backpropagation, doesn’t like it when things recurse too much. Consider the case of a chain of linear neurons. If the (norm of the) weights of those neurons is not equal to one, the error signal will either successively shrink or grow. If the shrinking is too fast we see that it vanishes, and if it grows too fast it is said that it explodes. The picture is roughly the same for neurons with nonlinear activation functions, except that we think about the norm of the Jacobians instead but it’s basically the same deal. The scaling is not cool because it means when we propagate errors backwards from the loss we see that the signal gets way out of whack, meaning the updates to the weights are useless and our network training diverges.

Great, we’re all familiar with vanishing and exploding gradients. What do we do about it? The basic strategy is to make sure all that scaling doesn’t mess with the error signal while still enabling depth. We need to protect some of that signal.

In feed-forward networks (FFNs) we can accomplish this with skip connections. The simplest case is residual learning, where the output from a lower layer to a higher layer is added to that higher layer’s output. Then successive layers only have to learn an increment, since they are already “starting” from where the previous layer was. This residual is argued to be easier to learn than building a new transformation from scratch.

Another reason why this works is that residual connections actually enable a shorter path from a given layer to the output. All you need to do is travel through the residual connections of successive layers until you get to the end, skipping the layers in between. As it happens, deep residual networks actually have an effective depth that is significantly smaller than the specified depth because this is exactly what happens in training. In other words, we have avoided a lot of unfavorable transformations. This is implicitly happening in residual nets, but can be explicitly made so by having a connection between every layer and the output which is called a dense network.

By the way, it is possible that starting from the previous layer’s output actually makes it more difficult to learn the best transformation to come next. Or at least, it could just be an inefficient use of many layers. So, we introduce the concept of a gate and say let’s learn a coefficient to attenuate how much the identity connection should be used versus the normal stacked layers. Then the residual networks become special cases of networks with weighted skip connections called highway networks.

This is all very interesting because in a sense we have implemented a memory for FFNs, and with gating mechanisms we can learn when to access that memory and when to ignore it. You’ve created some representation at a lower layer and that signal can skip many layers and find itself somewhere else. The (learned) gate on that somewhere else layer tells you whether to access that info or not.

Let us think about a different problem now: sequence modeling. FFNs are great for this of course, and sequences are a nice way to justify depth. If you are using a convolutional FFN (as everyone does nowadays), you actually require a certain level of depth to take into account long-range correlations in your input. This is exceptionally obvious in language modeling where long range correlations are potentially really long. Then skip connections and also things like attention are acting as memory lookup mechanisms for more fine-grained information to service the higher level representations. It’s all very cool.

There is another fun way to look at this too, and that is with recurrent networks (RNNs). In that case, we have a single hidden layer that is applied to every element in a sequence, and then after the final element you backprop your error “in time”. So as you can imagine, our gradients can and do vanish by the time they get back to the beginning of the sequence. However, since we use the same weights at each sequence step, adding a skip connection doesn’t make sense here.

What the RNN people have done instead is to put in other little memory systems using a gating mechanism. Now you can protect some part of your signal by writing it to a special state somewhere and learn when to read it back. So what does effective depth mean for an RNN? Not sure it is really comparable beyond conceptually thinking of depth in the sequence direction, but at any rate it is an interesting analogy.

It is actually pretty funny to think about how we haven’t solved the vanishing or exploding gradients problem at all, we just side stepped it. So now we can ask whether this is natural or weird. I happen to feel that it is still weird that we can pull hierarchical representations out of thin air so well. It is interesting to note the effectiveness of intermediate losses, particularly in NLP applications where one has this type of information due to the existence of various parsers that can generate features for you to predict (e.g., part-of-speech, named entities).

It is pretty obvious why this works and it’s related to skip connections. Simply, you are anchoring your representations to something besides the final layer’s output. In the residual FFN case, it is to the lower layer outputs, and this is no different in that it is also an additional signal you know is a priori a good representation to some degree (just like your inputs!).

It does seem for the moment that vanishing and exploding gradients are a fundamental, though not insurmountable, problem.