Engineering Agents: A Primer Into AIs That Can Take Action

Science fiction depicts AIs as computers that can think and take action like humans. This primer is about engineering that into real systems called agents.

AI has predominantly been about pattern recognition. Is this photo of a cat or dog? Is that product review positive or negative? But these AIs don’t take action, that’s left for normal software.

Pattern generation has captured our imaginations with photorealistic images and human-like conversations. And what emerges from recognition and generation together is the ability to reason.

An AI that can recognize, reason, and generate is an AI that can make decisions. It is an AI that can take action.

But how does that work? What systems have been built, and what are their capabilities? What do they require?

And why now?

We’ll explore the sudden trend, motivate the anatomy of an agent from emergent AI capabilities, explain the hurdles for production, and discuss fruitful use-cases for agent-powered products.

Anatomy of an Agent

Agents are not new. They are seeing a resurgence because they’ve become orders of magnitude easier to build due to large language models (LLMs).

Before LLMs, agents relied on reinforcement learning (RL) models based on neural networks like AlphaStar. These agents train on raw environment data and output actions directly. They are notoriously difficult to train and generally uninterpretable like all neural networks.

LLMs are neural networks too, but their chat-style prompting interface is a significant productivity boost. In contrast to labeling data and training a model, one simply needs to write a prompt to create a new AI model.

The ability to steer models in natural language in a conversational way makes some basic but important functions possible:

Reasoning: chaining together bits of logic
Self-reflection: reasoning about their own “thought process”
Tool-use: learning to use an external tool with a well-defined API, like a calculator or calendar app

(See the appendix for concrete examples.)

Combining them enables interesting capabilities:

Planning: turn a high-level goal into a series of tasks with reasoning (”This goal requires tasks A, B, and C”) and self-reflection (”Am I missing any steps?”)
Memory: combining reasoning (”What information does task A require?”) and tool-use (”Which tool can I use to access this information?”) means we can retrieve information from dedicated memory (databases)
Action: interact with the environment using reasoning (”How can I accomplish task A?”) and tool-use (”Which tool can I use to solve this task?”)

Lilian Weng’s diagram below visualizes the architecture nicely.

Taken from Lilian Weng’s essay on LLM Powered Autonomous Agents (June 2023).

(See the appendix for a deeper explanation tying components together in pseudo-code.)

Production Agent Systems

Unlike research or demonstrations, production AI systems rely critically on data loops. AI systems are adaptive, and for AI models to do their job they (along with us, the developers) must continually discover it.

Architecting a data loop requires:

Qualitative Understanding: What is the right thing for the AI to do from a user’s perspective?
Quantitative Understanding: How well did the AI do the right thing?
Scientific Analysis: What errors did the AI make, and why?
Engineering Improvement: How can we fix errors as systematically as possible?

This workflow is necessary for developing an AI model.

Agents are complex systems of AI models. They are hard because developing AI models is hard, and the components of an agent comprise of many models and many programs.

There are some especially important problems to note:

Hallucinations: LLMs are prone to making up facts. This affects agents when those facts are assumptions about tracking resources under its control or calling out to non-existent tools. It also hampers a developer’s ability to debug the model.
Error cascades: The internal reasoning of an agent is sequential, so one error can set the entire reasoning trajectory off-course.
Feedback deprivation: Agent components won’t all have feedback data available for error discovery and improvement.

However, there are aspects of agents that make them easier to develop too:

Interpretable reasoning: AI models like LLMs and RL agents execute logic via mathematical operations that are very hard to interpret. LLM-based agents reason in natural language, which enable any non-technical domain expert to understand and even fix its behavior. Below is a “reasoning trajectory” in a ReAct-based agent: Question: Musician and satirist Allie Goertz wrote a song about the "The Simpsons" character Milhouse, who Matt Groening named after who? Thought 1: The question simplifies to "The Simpsons" character Milhouse is named after who. I only need to search Milhouse and find who it is named after. Action 1: Search[Milhouse] Observation 1: Milhouse Mussolini Van Houten is a recurring character [...] Thought 2: [...] Each step in this reasoning trajectory is interpretable by a human. Importantly, if an error were spotted in any step, a human may intervene, correct that step, and continue running the agent to produce the correct answer.
Symbolic components: Agents generally rely on databases for memory and symbolic programs for tools. This makes for interpretable error traces and gives the developer of the agent a chance to reason through them. It also bounds the lengths at which errors cascades can fan out.
Decomposability: As mentioned before, decomposing agents into parts has a number of advantages. A single component can be observed, isolated, and improved, in contrast with end-to-end AI models with impenetrable internals that are all coupled together. This makes them much more easily debuggable.

The emphasis in all three aspects is understanding. Why?

Interpretability is the ultimate unlock towards scalable, systematic understanding. It is the science of AI development.

And in the same spirit, so are symbolic components and decomposability. All three work in concert to make errors observable, understandable, isolatable, and therefore possible to intervene and prove causation. All of that capability makes it relatively easy to fix a single error, and to improve an entire collection of an agent’s operating trajectories. It’s the cascading errors problem run in reverse—fix one line of faulty logic and a great many reasoning paths are fixed too.

Importantly, interpretability enables non-scientists and even non-technical people with the right domain knowledge to contribute to improving the system in this leveraged way.

And with the right product design, that may even extend to users.

Agents As Products

Many problems for which agents look promising are under-explored. Architecting data loops will be a challenge. Measurement techniques will need refinement. And the more foundational aspects of building a great product, like UX or communicating value prop to the user at various stages of their journey will constantly be in flux. Most of the learning is still ahead of us, so our analysis should be taken with that uncertainty in mind.

There are key practical restrictions on agent systems that translate to product constraints:

Asynchronous: LLM latency is on the order of seconds for the largest models. Because agents employ LLMs in multiple components sequentially, and chiefly as the core reasoning and planning model, this makes agents unsuitable for any job that can’t be done asynchronously and with low frequency (tens of minutes or longer).

However, this does not preclude products jumping from async to sync or interactive applications as they build ever-stronger data loops and train smaller, faster, and more specialized reasoning and planning models. In the context of agents for software engineering, this may be agents that run in continuous integration pipelines, building tests in the background, monitoring production systems, or anything else not requiring interactivity. As they improve, they may instead run on every local commit, or even in real time inside code IDEs.
Low Performance Bar: Recall that the foundational requirement for building AI systems is answering, “What is the right thing for the AI to do?” Given how little exploration for agentic products has been done, this will necessarily result in poor performance to start largely because reasoning errors compound. Products where mistakes are tolerable, easily correctable, and of small but net-positive value are ideal.

Code copilots are a fantastic example, as are related examples like bug or security vulnerability detection, test generation, and performance engineering products.
Rich Feedback Channels: Low performance is both a fact of an under-explored product space with lots of changing requirements and of poorly understood engineering principles. A very fast and scalable improvement curve is necessary for a product to survive. The only way towards this is building strong data loops. This only works with the potential for rich feedback channels. Note the use of “potential”—feedback channels must be engineered, their existence being a combination of both great UX and motivated users.

This goes beyond collecting labels marking good or bad outcomes. We mentioned interpretable reasoning as an important positive attribute of agents. The great unlock of a product which can be designed to expose reasoning chains to the user may be an infinitely scalable channel for very rich feedback.

For instance, if a debugging agent makes a mistake midway through a reasoning chain, a developer who is exposed to it has both the means to fix it (the affordances to modify reasoning and continue running) and the motivations to do so (by seeing just how much work the agent was already able to do autonomously). Not only does the product work correctly for the user immediately, but the agent’s future reasoning improves for many other tasks and other users as well (again, it is the error cascades problem run in reverse).
Rich Environments & Available Tools: Agents can only do things if they have the means to take actions through tools. Tools are also the primary method of retrieving external knowledge. Lacking quality APIs in a given environment is ruinous for an agent. Emphasis on quality—if an API is hard to learn for a human, it will be hard for an agent. This means more effort in detecting errors, collecting data, and fine-tuning models.

It’s worth noting the spectrum of interfaces in the digital world. Agents that can browse the web or understand a video game visually are at a disadvantage. If the interface is HTML or pixels then a model will be required to integrate, exacerbating the cascading errors problem in addition to losing interpretability. While an exciting research prospect (see Gato or GPT-4’s visual capabilities), this deals a hard blow to any aspirations for a product.

Software engineers expect APIs for all systems in order to do their work. This makes engineering a good category of use-cases for agents. Desktop “power tools” like Figma, Photoshop, Blender, or Excel are also good categories since the product is built for power users with heavy usage, leading often to a need for customizability and advanced capability. This translates to rich plugin systems, SDKs, or in the unique case of spreadsheets a product that is inherently hackable with formulas and data referencing.

Closing Thoughts

Agents are gaining popularity as a concept because LLMs demonstrate reasoning, self-reflection, and tool-use. These combine to create components of agents capable of planning, memorization, and action.

LLMs being the central new agent technology has important implications for engineering agents for production and the types of products that can be built. The human-interpretable chains of reasoning enables engineers, non-technical domain experts, and even users to participate in the development of agent systems to improve them. They replace the arduous and unscalable human-in-the-loop labeling and training methods of prior ML and RL systems.

While agents suffer from problems of cascading errors and a dearth of direct feedback for internal components, interpretability of an agents reasoning, its component-wise architecture, and (generally) deterministic behavior of its tools are promising attributes to combat these issues.

In any nascent technology it is important to find sustainable paths to further development outside of research. Product use-cases that don’t require interactivity, a high bar for performance, and include accessible tools and rich feedback channels are a good starting point for building agents that solve real problems.

Of course, it is easy to reason at a high-level about how agent technology will evolve. While this primer is intended to be high-level, the hope is to also serve as a foundation for in-depth, practical work in understanding how to engineer agent systems for useful products.

We hope you’ll join us for the work to come!

Appendix

LLM Capabilities Examples

Here we give some concrete examples of the foundational LLM capabilities.

Reasoning Example: Chain-of-Thought. Imagine the following prompt and its response. Prompt: I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with? Response: 11 apples That’s incorrect. But adding Let's think step by step. to the end of the prompt after the question results in both an excellent explanation and the correct answer: First, you started with 10 apples. You gave away 2 apples to the neighbor and 2 to the repairman, so you had 6 apples left. Then you bought 5 more apples, so now you had 11 apples. Finally, you ate 1 apple, so you would remain with 10 apples.
Self-reflection Example: Reflexion agents’ self-reflection component. Imagine the above Chain-of-Thought example was incorrect. A self-reflection model could be triggered with the following prompt: You are an advanced reasoning agent that can improve based on self refection. You were unsuccessful in answering the question. Diagnose a possible reason for failure and devise a new, concise, high level plan that aims to mitigate the same failure. Use complete sentences. Previous trial: Question: {question} {reasoning trajectory} Reflection: Where {question} would be the question asked and {reasoning trajectory} would be the reasoning steps produced by the agent. On the next round the agent would be given this reflection along with the previous reasoning trajectory to produce an improved reasoning trajectory.
Tool-use Example: Toolformer is a fine-tuned LLM that learns to turn these generations: Joe Biden was born in Scranton, Pennsylvania Into this: Joe Biden was born in [QA("Where was Joe Biden born?") -> "Scranton"], [QA("In which state is Scranton?") -> "Pennsylvania"] The [QA() -> portion triggers an API call to an external tool, like a facts database, and injects the tool response, and closes with the square bracket to complete the sequence

Detailed Agent Architecture

Recall the architecture diagram for an agent.

The best way to understand this is inside-out with some simple pseudo-code.

The agent implements a control loop that determines how its components are used:

while is_not_accomplished(goal):
	plan_steps = plan(goal)
	last_output = None
	for step in plan_steps:
		last_output = step.execute(last_output)

The planner might decompose the goal into subgoals and use a self-reflection loop:

def plan(goal):
	is_sufficient = False
	while not is_sufficient:
		steps = decompose(goal)
		is_sufficient, critique = critique(steps)
	return steps

And the execute() method determines how a step in the plan materializes into action:

def execute(...):
	op_type = self.operation_type
	if op_type == MemoryRetrievalOp:
		return retrieve(step)
	elif op_type == ActionOp:
		return act(step)
	elif op_type == PlanningOp:
		return plan(step)
	...

hadsed