Overview of Context Engineering in AI

Hey, this is Lance from Langchain. You might have heard this term context engineering recently. It's a nice way to capture many of the different things that we do when we're building agents. Of course, agents need context. They need instructions. They need external knowledge. They also will use feedback from tool calls. Context engineering is just the art and science of filling the context window with just the right information at each step of the agent's trajectory. Now I want to talk about a few different strategies for context engineering which we can group into writing context, selecting context, compressing context, and isolating context. I'll walk through some interesting examples of each one with popular agents that we use frequently in our day-to-day work. And I'll also talk about how Langraph is designed to support all of these. But first, what is context engineering? Where did this come from? Well, Toby from Shopify had an interesting post here saying he likes the term contact engineering. Karpathy followed up on this and offered a good definition. Contact engineering is the delicate art and science of filling the context window with just the right information for the next step. And Carpathies recently highlighted an interesting analogy between LLMs and operating systems like the LLM is a CPU context windows like RAM or its working memory. And importantly, it has limited capacity to handle context. And so just like an operating system curates what fits in RAM, context engineering you can think of as the discipline or the art and science of deciding what needs to fit in context. Oh, for example, each step of an agent trajectory. Now, what are the types of context we're talking about? Well, we can think about this as an umbrella over a few different themes. So, one is instructions. You've heard a lot about prompt engineering, and that's just a subset of this. There's of course things like memories. There's few shot examples. There's tool descriptions. There's also knowledge that could be facts. That could be memories. And there's tools which could be feedback from the environment for example using APIs or calculator or other tools. And so you have all these sources of context that are flowing in to LLM when you're building applications. Now why is this a bit trickier for agents in particular? Well, agents have at least two properties. They often handle longer running tasks or higher complexity tasks and they also utilize tool calling. Now both of these things result in larger context utilization. For example, the feedback from tool calls can accumulate in the context window or just very long running tasks can accumulate lots of token usage over many turns. Here's kind of an example showing turn one you call a tool. Turn two you call another tool. If you have a large number of turns, that tool feedback just grows and grows and grows. Now, what's the problem with that? This blog post from Drew Brun nicely outlines a number of specific context failures, context poisoning, distraction, curation, and clash. I encourage you to read that post. It's really interesting, but it's kind of intuitive. As the context grows longer, there's more information for an LM to process and there's more opportunities for an LM to get, for example, confused due to conflicting information or injection of a hallucination which influences the response in an adverse way. And so for these reasons, contexting is particularly critical when building agents because they typically have to handle longer context for the reasons mentioned above. Now, cognition highlighted this pretty nicely in a recent blog post saying context engineering is effectively the number one job of engineers building AI agents. So, what can we do about this? Well, I've had a look at many different popular agents that many of us use today. Thought about this a lot, reflected on my own experience. You can kind of distill down approaches into four bins. writing context. Saving outside the context window to help an agent perform a task. Selecting context, selectively pulling context into the context window to help an agent perform a task. Compressing, retaining only the most relevant tokens. And isolating, splitting context up again to help an agent perform a task. And now I'll talk about some examples of each of these categories. So writing context. Writing context means saving outside the context window help an agent perform a task. When humans solve tasks, we take notes and we remember things for future related tasks. Well, agents can do those same two things. For note-taking, agents can use a scratch pad. And for remembering things, agents can use memory. So, you can think about scratch pads as kind of a term that captures the idea of persisting information while an agent is performing a task. And I'll give a good example of this. Anthropic's recent multi-agent researcher. The lead researcher begins by thinking through the approach and saving that plan to memory to persist it. And this is a great point. You want to keep the plan around. The context window might exceed the limit of 200,000 tokens, but the plan can always be retrieved and retained. Very intuitive example of taking a note and saving it in a scratchpad. Now, I do want to make a subtle point. The implementation for your scratch pad can differ. So, in their case, they just save it to a file. But you can also for example save it to a runtime state object depending on what agent library you're using. But the intuition is really that you want to be able to write information while the agent is solving a task so the agent can then recall that information later if it needs. Now memories are a bit different. Sometimes we want to save information across many different sessions with an agent. So typically scratch pads are relevant only within a single agent session. An agent's trying to solve a problem. It'll use a scratch pad to solve the problem and then scratch pad's not relevant anymore. Memories are things that you want the agent to retain over time over many sessions. So there's some fun examples from the literature. Gener agents for example synthesize memories from collections of past agent feedback. And you've seen this chatbt has a great memory feature. Cursor windsurf also will autogenerate memories based on user agent interactions. So this pattern is certainly emerging with popular AI products. And again the intuition is pretty clear here. You have some new context. You have some existing memories. And you can update memories with new information dynamically as the agent is interacting with the user. Now, we covered writing context and talked about a few examples of that. Now, let's talk about selecting context. So, selection means pulling context into the context window to help an agent perform a task. Now, we kind of talked about this previously with scratch pads. Of course, an agent can reference what it wrote previously. This could be via a tool call. This could be by reading directly from a state object. Now, memories are a bit more interesting and a bit more subtle. There's different types of memories you might want to pull into context depending on the problem you're trying to solve. Could be fot examples that provide specific instructions for a desired behavior. It could be a selectively chosen prompt for a given task. could be facts. And these are different memory types. Semantic memories in humans, for example, are things like facts, things that I learned in school. Episodic memories are more analogous to fot examples, past experiences. Procedural memories are like instructions, instincts, motor skills. These are all things that we may want to selectively pull into context to help an agent solve problems. Now, where do these come up practically different agents we work with today? Well, you think about instructions or procedural memories are typically captured in things like rules files or like cloud MD when you're working with the code agents. This is typically a file that has like style guidelines or general instructions for tools to use with a given project. And often times these are all pulled into context. For example, when you start cloud code, it'll just pull in all the cla files that you have in your project and in your organization. Now, facts are a bit more subtle. Oftentimes we want to selectively pull in facts from a large collection. And this is where it's common to think about things like embedding based similarity search or graph databases to actually house collections of memories in order to better control their retrieval and ensure that only relevant memories from a large collection are pulled in at the right time. Now tools are another very interesting thing we often want to pull into context. Now, one of the problems is that agents have difficulty with handling large collections of tools. This paper I link here has some interesting results on that showing degradation after around 30 tools and complete failure at around 100 tools. And so they propose using rag over tool descriptions. So that's just basically embedding tool descriptions and using retrieval based on semantic similarity to fetch out relevant tools for a given task. And this can improve performance significantly. So it's a nice trick for pulling selectively only relevant tools into context. Now finally I want to talk about knowledge. Rag is a huge topic and you can kind of think about memories as a subset of rag but rag is of course much broader. We often want to augment the knowledge base of an LLM with for example private tokens. The code agents are some of the largest scale rag apps currently and I actually thought this post from Verun the CEO of Windsurf was quite interesting. He talks a bit about the approaches that they use for retrieval. And the real point is it's quite non-trivial. So of course you're using indexing and embedding based similarity search as a core rack technique. But then you have to think about chunking. So they for example do a parsing to chunk along semantically meaningful code boundaries, not just random blocks of code. So that's kind of point one. But he also mentions that pure embedding based search can become unreliable. So they use a combination of techniques like GP file search knowledge graphs and they use an LLM based ranking on top of all of these. So when you look at knowledge selection in popular agents for example like the code agents this is highly non-trivial and the huge amount of context engineering goes into knowledge selection. So we talked about writing context we talked about selecting context. Now let's talk about compressing context. This really involves retaining only the tokens required to perform a task. So a common idea here is summarization. If you used cloud code for example, you might have noticed that it'll call autoco compact once the session reaches 95% of the context window, 200,000 tokens for the cloud series. And so that's an example of applying summarization across a full kind of agent user trajectory. But what's kind of interesting is you can also apply summarization a bit more narrowly. Like anthropic's recent multi- aent research paper talked about applying summarization only to completed work sections. Cognition's post makes a similar point, but they apply summarization in this example at the interface between different agents and sub aents. So they kind of use summarization to compress context such that in this case sub aent one has a compression of the context that the initial agent was using. So it's kind of a means of information handoff in this case between linear sub aents. But the principle is the same in all cases. Summarization is a very useful technique for compressing the context in order to manage overall token bloat when working with agents. Now it's also worth calling out that you can use trimming as well. And you can think kind of think about this as more selective removal of tokens that you know are relevant. So you can use for example heruristics to keep only recent messages. That's kind of a simple approach. Or you can use learned approaches and this is an LLM based approach for trimming or context pruning. So we covered writing, selecting compressing context. Now let's talk about the final category of isolating context. So isolating context involves splitting up to help an agent perform a task. Now multi- aent is the most intuitive example here. So the swarm library from open AI was designed based upon separation of concerns where a team of agents can all have their own context window and tools and instructions. Anthropic made this a bit more explicit in their recent multi-agent researcher post. They mentioned that the sub agents operate in parallel with their own context windows exploring different aspects of the question simultaneously. And one of the key points from their blog post is really that the ability to use multi- aent expands the number of tokens that the overall system can process because each agent has its own context window and can independently research a subtopic and that allows for richer generation of reports because the system was able to process more tokens across these various sub aents. Now there's some other techniques for context isolation that I want to call out. I thought hugging faces open deep research gives an interesting example. So they use a code agent. A code agent uses an LM to generate executable code that contains whatever tool calls that the agent wants to run. And the code is just executed in a sandbox which will then run all the tools and selective information can then be passed back to the LLM return values standard out variable names and so forth that the Elm can reason about. The key point they make is that this sandbox can actually persist state over multiple turns. So you don't have to dump back all the context to the LLM when you're doing multiple turns of an agent. this environment can kind of house a lot of tokenheavy information like images or audio files so they never expose to the LM's context window. And this is another nice trick for isolating tokenheavy objects from the LM context window and only selectively passing back things that you know that the LM will need to make the next decision. And I do want to call out that a runtime state object is another kind of obvious way to per perform context isolation. And it's kind of common to create like for example a data model like a pedantic model for your state which has different fields and those fields are just like different buckets that you can dump context into and you can selectively decide what you want to fish out at what point and then passed to the LM at a certain stage in your agent. So it's another very nice and intuitive way to isolate context. So we've covered these four categories writing selecting compressing and isolating context and talked about a bunch of examples from popular agents. Now I want to talk about how langraph enables all of these. So first as a preface before you undertake on contact engineering it's useful to have at least two things. One is the ability to actually track tokens that can be achieved through tracing and observability. Langmith is a great way to do that. Also ideally evaluation so some way to measure the effect of context engineering effort. Like here's a good example. Let's say you're using some kind of context compression. You want to make sure you didn't actually degrade the agents behavior. And so simple evaluation using lang is a great way to ensure of that. So those are kind of table stakes before actually undertake any context engineering effort. So let's talk about this idea of writing context in line graph. So lang graph is a low-level orchestration framework for building agents. You can lay agents out as a set of nodes and edges connecting those nodes. Like for example, with the typical tool calling agent, you'll have one node that makes an LM call, one node that just executes the tools, and you'll just bounce between those two. Super simple layout for a classic tool calling agent. Now, this notion of a scratch bed is actually very nicely supported in Langraph because Lang graph is designed around the idea of a state object. So, what happens is in every node, this state object is accessible and you can fish anything you want from the state object and write anything back to it. The state object is typically defined up front when you lay out your graph. It can be for example a dictionary, a type dict, a padantic model. It's just a data model. You define it and it's accessible to you. It's perfect for this notion of scratchpad in for example the LM node of your agent. The agent can take notes. Those notes can be written to state. And now that state is checkpointed across the lifetime of your agent. And so you can access that state at any node at any point within the session. And so it's available to you for example at future turns. And that's exactly the intuition around why this idea of a scratchman is so useful. Agents can write down things fetch them later. Now for memory lang is actually designed with long-term memory as a first class component. Long-term memory is accessible in every node of your graph and you can very easily write to it. So the key point is within a session, Langraph has persistence via checkpointing where agent state is accessible at every node. If you want to save things across different agent sessions, that's also achievable in Langraph using Langraph's native built-in long-term memory. Now, I talked a bit about this previously. How about selecting context? Well, within Langraph, you can select, for example, from state in any node, and that can serve as a scratch pad. You can also retrieve from long-term memory in any node. And what's interesting is long-term memory can store different memory types. You can s you can store simple files. You can also store collections and use embedding based similarity search as an example. So you can see these two resources. Check out this course from deep learning AI if you want to learn a lot about the different memory types. And check out this recent course on ambient agents if you want a simple crisp example of long-term memory in a longunning email assistant agent. And the nice thing about that example is the agent updates memory based on human feedback. So it really shows that feedback between human and loop memory updating and then persisting long-term memories over time to govern and improve the behavior of agent in a kind of virtuous cycle as you give it more feedback. Now this pre-built langraph big tool is actually a really neat example of tool selection in Langraph. So it uses exactly the principle that was mentioned previously embedding based semantic similarity search across tool descriptions and you can see it all right here in this repo but it's quite an effective way to select across large collections of tools and for rag lang graph is very low-level framework you can implement many different rag techniques using langraph I linked to some tutorials that we have we also have a lot of popular videos on building rag workflows or agents with langraph now How about context compression? Lapraph has a few useful utilities for summarizing and trimming message history when you're building agents which can be used out of the box. But it's also of course a low-level framework. So you flexibility to define logic within each node of your agent. And one thing I do frequently is I'll actually have some logic to post-process certain tokenheavy tool calls just inside my tool node. So you can kind of look at what tools called and then kick off a little post-processing step depending upon the tool that was selected by the LLM. So it's very easy to augment the logic of your agent in lang graph to incorporate things like post-processing because it's just a low-level framework and you control all the logic within each node of your agent graph. I can I show a little example here in our open deep research. Now context isolation we actually have a lot of we've actually done a lot of work on multi- aent. We have implementations for both supervisor and swarm. They're popular open source implementations. their popular multi-agent implementations and a bunch of different videos here that show how to get started. So multi-agent is something that has been well supported in Langraph for a while. Langraph also works nicely with different environments and sandboxes. This is a cool repo by Jacob. You can check out it uses E2B within a Langraph node to actually do code execution. And this video here talks about using a sandbox with langraph but in that case with state persistence. So the states persisted within that sandbox across different turns of the agent which is exactly what we saw in the hugging face example and this could be a very nice way to isolate context within an environment and prevent it from flooding back into the context window of your agent. And finally that you have a state object. So state objects are are central in line graph as mentioned they're available to you within each node of your graph. You can read from it you can write to it. You can design the state object to have a schema. For example, it can be a pyantic model and it can have multiple fields in that schema. So for example, you can have one field like messages and that's always exposed to the LM at each turn of your agent, but you can have other field that just saves it any other information you want to keep around but only use at specific points like maybe towards the end of your agent trajectory. So it's very easy to organize and isolate context in your state object just by defining a simple schema. So just to summarize, there's at least four overall categories for context engineering, though we've seen across many popular agents. Writing, selecting, compressing, and isolating context. Writing context typically means saving it outside the context window to help an agent perform a task. And usually this is so that the agent can retrieve that context at a later point in time. Could be a scratch pad, could be just writing it to a state object, could be writing to long-term memories. selecting context. It could be retrieving tools. Could be retrieving information from a scratch pad that an agent is using to accomplish a task within a given session. Could be retrieving long-term memories that provide guidance for an agent based upon past interactions. It could be just retrieving relevant knowledge. And that of course gets into the entire topic of rag which is a very deep one. Then there's compressing context summarizing it, trimming it, basically trying to retain only the most relevant tokens required to perform a task and isolating it. A great way and simple way to do this is just partitioning context in a state object. Sandboxing is an interesting approach to isolate context from the LM. And of course, there's multi- aent which involves splitting contexts up between different sub aents that perform independent tasks, but they collectively can increase the overall number of tokens that the system processes in order to accomplish a task. So these are just a few of the categories. This is very much a moving and emerging field. This is not a complete list, but at least it's my attempt to organize the space, talk about some interesting examples, and hopefully give some references about how to do each of these things in line graph. Thanks a lot.

Transcript for:Overview of Context Engineering in AI

Transcript for:
Overview of Context Engineering in AI