Transcript for:
Exploring Context Windows in LLMs

In the context of large language models. What is a context window? Well, it's the equivalent of its working memory. It determines how long of a conversation the LLM can carry out without forgetting details from earlier in the exchange. And allow me to illustrate this using the scientifically recognized IBU scale that's international blah units. So blah here, that represents me sending a prompt to an LLM chatbot. Now the chatbot that returns with a response blah. Right. And then we continue the conversation. So I say something else and then it responds back to me. Blah, blah, blah, blah. International blah units. Now, this box here represents the context window, and in this case, the entire conversation fits within it. Now, that means that when the LLM generated this response here, this blah, it had within its working memory my prompts to the model here and here. And it also had the other response that the model had returned to me in order to build this response. All good. Now let's consider a longer conversation. So more blahs. I send my prompt blah. It then sends me a response. And now we go back and forth with more conversations. I say something. It responds to that. I say one more thing and it responds to that. So now we have a longer conversation here to deal with. And it turns out that this conversation thread is actually longer than the context window of the model. Now, that means that the blahs from earlier in the conversation are no longer available to the model. It has no memory of them when generating new responses. Now the LLM can do its best to infer what came earlier by looking at the conversation that is within its context window. But now the LLM is making educated guesses and that can result in some wicked hallucinations. So understanding how the context window works is essential to getting the most out about a LLMs. Let's get into a bit more detail about that now. Now my producer is telling me that context window size is in fact not measured in IBUs and that I made that up. We actually measure context windows in something called tokens. So let's describe tokenization. Let's get into context, length, size, and we're going to talk about the challenges of long context windows. So the start, what is a token? Well, for us humans, the smallest unit of information that we use to represent language is a single. Character. So something like a letter or a number or a punctuation mark, something like that. But the smallest unit of language that AI models use is called a token. Now, a token can represent a character as well. But it might also be a part of a word or a whole word or even a short multi-word phrase. So, for example, let's consider the different roles played by the letter A. So I'm going to write some sentences and we're going to tokenize them. Let's start with Martin drove a car. Now A here is an entire word and it will be represented by a distinct token. Now, what if we try a different sentence? So, Martin is amoral. Not sure why we would say that, but look, in this case, A is not a word, but it's an addition to moral that significantly changes the meaning of that word. So here a moral would be represented by two distinct tokens, a token for A and another token for moral. All right. one more. Martin loves his cat. Now the A in cat is simply a letter. In a word, it carries no semantic meaning by itself and would therefore not be a distinct token. The token here It's just cat. Now, the tool, the converts language, to tokens. It's got a name. It's called a tokenizer. And different tokenizer, as might tokenize the same passage of writing differently. But kind of a good rule of thumb is that a a regular word in English language is represented by something like 1.5 tokens by the tokenizer. So hundred words that might result in 150 tokens. So context windows consist of tokens, but how many tokens are we actually talking about? To answer that, we need to understand how LLM process tokens in a context window. Now, transformer models use something called the self attention mechanism. And the self attention mechanism is used to calculate the relationships and the dependencies between different parts of an input like words at the beginning and at the end of a paragraph. Now self attention mechanism computes vectors of weights in which each weight represents how relevant that token is to the other tokens in the sequence. So the size of the context window determines the maximum number of tokens that the model can pay attention to at any one time. Now, context window size has been rapidly increasing. So the first LLMs that I used, they had context windows of around 2000 tokens. The IBM Granite three model today has a context window of 128,000 tokens, and other models have larger context when they still. And but it almost seems like overkill, doesn't it? I would have to be conversing with a chat bot all day to fill a 128K token window. Well, actually, it's not necessarily true because there can be a lot of things taking up space within a model's context window. So let's take a look at what some of those things could be. Well, one of them is the the user input, the the blah that I sent into the model. And of course, we also have the model responses as well, the blahs that it was sending back, but a context window may also contain all sorts of other things as well. So most models provide what is called a system prompt. Into the context window. Now, this is often hidden from the user. But it conditions the behavior of the model, telling it what it can and cannot do. A user may also choose to attach some documents into their contacts window, or they might put in some source code as well. And that can be used by the LLM to refer to it and its responses. And then supplementary information drawn from external data sources for retrieval augmented generation or RAG, that might be stored within the context window during inference. So a few long documents, some snippets of source code, I can quickly fill up a context window. So the bigger the context window, the better, right? Well, larger context windows do present some challenges as well. What sort of challenges? Well, I think the most obvious one that would have to be compute. The compute requirements scale quadratically with the length of a sequence. What does that mean? Well, essentially, as the number of input tokens doubles, that results in the model needing four times as much processing power to handle it. Now, remember, as the model predicts, the next token in a sequence. It computes the relationships between the token and every single preceding token in that sequence. So as context length increases, more and more computation is going to be required. Now, long context windows also can negatively affect performance, specifically the performance of the model. So like people and LLMs can be overwhelmed by an abundance of extra detail. They can also get lazy and take all sorts of cognitive shortcuts. A 2023 paper found that models perform best when relevant information is towards the beginning or towards the end of the input context. And they found that performance degrades when the model must carefully consider the information that is in the middle of long context. And then finally, we also have to be concerned with a number of safety challenges as well. Longer context window might have the unintended effect of presenting a longer attack surface for adversarial prompts, a long context length can increase a model's vulnerability to jailbreaking, where malicious content is embedded deep within the input, making it harder for the model safety mechanisms to detect and filter out harmful instructions. So no matter how you measure it with either with IBUs or more accurately, tokens, selecting the appropriate number of tokens for a context window involves balancing the need to supply ample information for the model's self attention mechanism. With the increasing demands and performance issues those additional tokens may bring.