How "lost in the middle"? Evaluating LLMs on information retrieval with long input contexts
LLMs have become increasingly powerful over the past two years. Across 2023 and 2024, we saw new models flourish — every few months, a new model would reach new heights in reasoning, retrieval, math and other benchmarks. From a practical standpoint, one of the most significant advancements has been the explosion in context length. We went from a mere 4,096 tokens for GPT-3.5 in late 2022 to a staggering 1 million for the Gemini 1.5 family of models (with Gemini 1.5 Pro boasting an even greater 2 million!).
This ability to handle extended contexts has been a game-changer for my team at Ario. We're building personal AI agents, and these longer contexts allow us to craft highly personalized responses for our users. We've been steadily enriching (and extending) our prompts with more and more personal data: memories, user profiles, conversation history, schedules, knowledge about friends and family, recent Amazon purchases, user-requested response customizations, and more. We've also been incorporating increasingly complex and nuanced instructions. (Try our app!)
Up until early 2024, we relied mostly on fairly standard techniques like semantic retrieval and re-ranking to fetch relevant user data, all within a modest context window of a few thousand tokens. However, as we packing our prompts with more data, we started noticing a troubling trend: some of our models were beginning to suffer from context saturation. Their ability to follow instructions and retrieve information was degrading as our context lengths grew, even though their advertised context windows grew even more quickly.
This led me to question: how reliable are long-context models across their entire contexts? Does their performance degrade significantly when using their key feature: long context windows? What should we be aware of when using long-context models for their long contexts?
This blog post is an attempt to shed some light on those questions. It started out as a series of simple experiments, but things got complex pretty quickly both due to the number of models and my interest in measuring performance accurately. It also focuses on a single aspect of performance: information retrieval accuracy. I will leave other aspects (e.g. instruction following) as a follow-up.
This blog post also grew quickly in length. Along the way, I realized that Substack probably isn’t the best platform for long-form technical writing — there is absolutely no support for syntax highlighting, inserting tables (or tables of content!), adding subscripted or superscripted text, or even inserting LaTeX symbols and equations inline. That said, I’m lazy to migrate this to another platform. I may convert this to a paper if there is enough interest.
I’d love to hear your thoughts and feedback, either here or via LinkedIn. Thank you for reading!
Introduction
In their late 2023 paper "Lost in the Middle: How Language Models Use Long Contexts", Nelson et al found that LLMs do not robustly make use of information in long input contexts. Specifically, they found that LLMs tended to demonstrate a "U-shaped" performance curve across both multi-document question answering and key-value retrieval, with generally weaker performance when retrieving data in the middle of the input context.
The term “robustness” is generally used in machine learning to describe resilience to some kind of perturbation. The authors likely meant that LLMs’ performance were often affected by changes to the position of relevant information within their (long) input contexts. In other words, LLMs did not retrieve evenly across their context windows.
The authors had evaluated various models that were state-of-the-art then — closed models e.g. OpenAI’s GPT-3.5-Turbo (up to 16K context length) and Claude 1.3 (up to 100K context length) as well as open models e.g. MPT 30B Instruct and Longchat 13B 16K (which is a fine-tuned version of Llama-13B).
One year later, we have seen an explosion of models with ever-increasing context lengths. Claude 1.3’s then-revolutionary 100K context length is no longer the best. Gemini 1.5 Pro has now taken the crown at 2M tokens, with Gemini 1.5 Flash (regular and 8B variants) weighing in at 1M tokens. The Claude family (v3.5) is now at 200K tokens, while OpenAI maxes out at 128K tokens. (Bear in mind that different model families use different tokenizers, so allow for some ~30% margin when considering context lengths.)
This “context length race” evokes memories of the “megahertz race” that we saw in the late 1990s/early 2000s between Intel and AMD, where each positioned its processors as having ever-faster clock speeds. But just as we asked then whether clock speeds mattered that much, let’s ask again now: are greater context lengths always better? How do models perform across different context lengths?
Evaluating model performance for long context windows
While LLMs have many real-world applications, one particular use case comes to mind when it comes to ever-increasing context lengths: information retrieval.
In the past, when we had much shorter context windows — e.g. 4,096 for the original GPT-3.5, or even 512 for BERT! — we often needed to build much more involved architectures for information retrieval, e.g. document Q&A use cases. We paid a lot of attention to document chunking, text embeddings generation and storage, semantic matching and re-ranking, and other steps involved in RAG. There was often a trade-off between using more input tokens and reasoning/instruction-following ability — longer input contexts generally led to higher risk of context saturation. (This is still true today, since model architectures and training have not substantially changed.) And more complex RAG systems were always harder and more expensive to maintain, more flaky, and slower.
With 128K or 1M or 2M tokens, however, it is tempting to radically simplify our architecture. Do we still need complex RAG-based systems? Why not dump everything into the model’s input context, and wait for magic to happen? Now that models are also getting cheaper and faster, this becomes an especially attractive proposition.
This is theoretically possible if these long-context models work well across their entire input context windows. The big question: do they?
Existing research
There are multiple ways to evaluate model performance — general knowledge and understanding (e.g. MMLU), reasoning and instruction following (e.g. GPQA, HellaSwag), instruction following, math (e.g. MATH), science (e.g. GPQA-diamond), coding (e.g. HumanEval), function calling (e.g. BFCL), etc. There are also a number of information retrieval benchmarks such as BEIR, MTEB and BERRI that focus on tasks like dense retrieval, semantic search and clustering (although not specifically for long-context use cases).
There have been several studies on the performance of long-context models, in addition to the Lost in the Middle paper.
A notable paper is “RULER: What’s the Real Context Size of Your Long-Context Language Models?” (last revised August 2024) by Hsieh et al. The authors proposed a more comprehensive framework for assessing performance, including (i) several types of retrieval, (ii) multi-hop tracing, (iii) aggregation, and (iv) question answering. They used this framework to evaluate ~30 long-context language models (mostly open-source or open-weight models) mostly in the range 32K to 1M tokens, and found that most models have a much shorter effective context window than their claimed length. Specifically, they observed that most models exhibited large degradation on tasks as context length increased. The five highest-ranked models then were Jamba-1.5 (both large and mini), Gemini 1.5 Pro, GPT-4-1106-preview, and Llama 3.1 (70B).
Also in August 2024, a team from Databricks published an article (Long Context RAG Performance of LLMs) which evaluated many current long-context LLMs on their retrieval performance. They studied how haystack size and needle count independently affected model performance at various context lengths, using four question-answer datasets (Natural Questions, HotpotQA from BEIR, FinanceBench, and a proprietary Databricks DocsQA v2). For each experiment, they include different numbers of document chunks in a given model’s input context, and evaluated the model’s ability to answer a constant set of questions from each of the four datasets.
Because the Databricks experiment is geared towards question answering, it is reasonable to assume that more input context increases the probability of having all relevant answers to answer each question set — so each model above should have a monotonically increasing curve, until the point of context saturation, at which point model performance drops.
Unfortunately, the article did not evaluate the performance of Gemini and Jamba models; it also did not consider how the Claude models performed between 125K and their maximum 200K tokens.
With this in mind, there are some gaps in current research:
- It tends to focus on open models, and does not cover all long-context models from the most popular commercial vendors — which are likely what many companies will use.
- It does not account for the cost of using each model.
- It does not study how performance varies within different models in the same family, or between different context lengths and task difficulty for the same model.
- There is little discussion of the impact of such performance on how we should design LLM-based systems in real life.
Questions
Therefore, this article seeks to ask the following questions:
- How accurately do the latest long-context LLMs (>= 128K tokens) retrieve information from across their context windows? Do LLMs retrieve better from certain parts of their context? What are the practical limits for each model, both in terms of input context length and retrieval task complexity, i.e. when do the models start getting “saturated”?
- How do LLMs’ retrieval performance compare with each other — both within and across model families?
- What are some viable LLM options for use cases of various context length requirements, and which LLMs are more cost effective?
- And finally, what are the implications of these findings to LLM application design?
Setting up the "phone book" experiment
There are several pre-existing datasets widely used for information retrieval, but almost all of them are document-based. I wanted to measure “pure” retrieval performance, and was inspired by the Lost in the Middle paper which used a key-value retrieval setup featuring keys and values which were all 128-bit UUIDs.
In this case, the models were provided with k key-value pairs and asked to identify the value for a single key. This meant that there were k - 1 “distractor” key-value pairs. Experiments were conducted for three values of k (75, 140 and 300) which corresponded to ~4K, ~8K and ~16K tokens. (While many models now accommodate >=128K tokens, a context length of 16K was likely state of the art when the paper was written in late 2023.)
Generally speaking, this is an associative recall problem; specifically, the RULER paper labels this as a “multi-key Needle-in-a-Haystack (NIAH)” problem.
However, Arora et al (2024) suggests that asking models to retrieve just a single value from a pool of many key-value pairs was not representative of how language works in practice. In the real world, we (and therefore language models, with real-world inputs) often need to perform multiple associative recalls in a single forward pass. Arora et al formally defined this as a “Multi-Query Associative Recall” problem; the RULER paper referred to this as “Multi-queries Needle-in-a-Haystack”.
To illustrate, imagine that you are walking into a busy cafe.
The menu is essentially a list of key-value pairs, with menu items serving as keys and prices serving as values. You plan to order a latte, an espresso and a croissant, and you need to figure out how much they cost in total. To do so, you will need to look up three keys and sum up their values, ideally from a single scan of the menu (since there is a long line behind you…). This is essentially a “multi-query associative recall” problem with a single forward pass.
Given the prevalence of such “multi-query associative recall” cases in real life, and given how we are increasingly applying LLMs to real life situations with ever-increasing context lengths, it is useful to evaluate how well LLMs perform in this regard.
Since I wanted to measure “pure” retrieval performance, I created a dataset modeled after phone books. Remember the Yellow Pages books?
These were essentially lists of businesses (and persons) and their corresponding phone numbers.
I created a dataset of 1M random 15-character “names” and 10-digit “phone numbers”. Here are the first ten:
heeuvnzhvevkvqn : (210) 300-4519
hjbbjzxdabxfvsf : (780) 252-8833
abbgstzdnrrtsis : (196) 540-7660
uzhcntegheuxctn : (958) 638-5717
ndklugdyihlszki : (996) 111-7671
nxkezztoahurigm : (072) 185-0440
ellmrwopzrbegax : (266) 039-5338
qtyljnnddweysbz : (911) 743-0818
bcojdpjvobcyaqm : (289) 317-9478
ggircpjdjmsyjzr : (597) 797-0899
Since the names are semantically meaningless, they are almost certainly not going to be found in LLMs’ training data. This will minimize any potential semantic “pollution” between the training data and the input contexts, so we will be starting from as blank a slate as possible.
I then provided each model with a list of names, and asked each model to retrieve the corresponding phone numbers. This is the experimental setup:
- Given a phone book (i.e. haystack) of size M, and a desired list of names (i.e. needles) of size N, split the phone book into N equal bins.
- For each run, pick one random name from each bin (1 through N). Check whether the model retrieved the correct phone number for each name. Perform 20 runs5.
- For each bin, compute the average retrieval performance (number of correct retrievals out of 20). Performing 20 runs reduces the impact of any outliers or bad runs.
- Plot a chart of retrieval performance for all bins (i.e. across the entire input context), for the given model, for M-sized phone book and N entries retrieved.
The three main indicators of interest are:
- Performance given context position (i.e. where was the phone number located within the input context?)
- Performance given context used (i.e. how much of the model’s available context was used?)
- Performance given retrieval count (i.e. how many phone numbers were retrieved?)
I ran these configurations:
- M: 1K, 5K, 8K, 10K, 20K, 40K, and 80K. Generally, the 128K context models maxed out at M=5K, which corresponded to ~100-120K tokens, depending on each model’s tokenization. The Claude family maxed out at 8K, and the Gemini 1.5 Flash (regular and 8B) models maxed out at 40K. Only Gemini 1.5 Pro supported M=80K.
- N: 10, 50, and 100.
I tested these models:
Model Family | Model | Supported Context | Provider |
---|---|---|---|
Gemini | gemini-1.5-flash-8b |
1M | |
Gemini | gemini-1.5-flash-002 |
1M | |
Gemini | gemini-1.5-pro-002 |
2M | |
OpenAI | gpt-4o-2024-08-06 |
128K | |
OpenAI | gpt-4o-mini-2024-07-18 |
128K | |
OpenAI | gpt-4-turbo-2024-04-09 |
128K | |
OpenAI | o1-2024-12-17 |
200K | |
OpenAI | o1-mini-2024-09-12 |
128K | |
Anthropic | claude-3-5-haiku-20241022 |
200K | |
Anthropic | claude-3-5-sonnet-20241022 |
200K | |
Anthropic | claude-3-opus-20240229 |
200K | |
AI21 | jamba-1-5-large |
256K | Openrouter |
AI21 | jamba-1-5-mini |
256K | Openrouter |
Qwen | Qwen2.5-72B-Instruct-Turbo |
128K | Together.ai |
Mistral | ministral-8b |
128K | Openrouter |
xAI | grok-2-1212 |
131K | Openrouter |
Deepseek AI | DeepSeek-R1 |
164K | Together.ai |
Deepseek AI | DeepSeek-V3 |
131K | Together.ai |
Llama | Meta-Llama-3.1-8B-Instruct-Turbo |
128K | Together.ai |
Llama | Meta-Llama-3.1-70B-Instruct-Turbo |
128K | Together.ai |
Llama | llama-3.2-90b-vision-instruct |
128K | Together.ai |
Llama | Meta-Llama-3.3-70B-Instruct-Turbo |
128K | Together.ai |
Llama | nousresearch/hermes-3-llama-3.1-405b |
128K | OpenRouter |
Microsoft | phi-3.5-mini-128k-instruct |
128K | OpenRouter |
Microsoft | phi-3.5-mini-128k-instruct |
128K | OpenRouter |
The following prompt was provided to all of these models (as a system prompt for those models that support system prompts, otherwise as the first user message):
You are a phone book assistant. You can help users find phone numbers for people.
You will be given one or more name(s), and you should respond with the corresponding phone number(s).
Return each name and phone number pair on a new line, in this format:
NAME_1 : PHONE_NUMBER_1
NAME_2 : PHONE_NUMBER_2
If you don't have the phone number for a given name, respond with 'NOT_FOUND' as the phone number.
# Phone book
heeuvnzhvevkvqn : (210) 300-4519
hjbbjzxdabxfvsf : (780) 252-8833
abbgstzdnrrtsis : (196) 540-7660
uzhcntegheuxctn : (958) 638-5717
...
The user message was:
Please find phone numbers for these names: heeuvnzhvevkvqn, hjbbjzxdabxfvsf, abbgstzdnrrtsis, ...
To ensure consistency across models, I requested plaintext output from each model. All models were run with a temperature
of 0 and a top_p
of 0.95.
Overall findings: LLMs retrieve less accurately in three scenarios
Long-context LLMs do not uniformly make use of information across their input contexts. The retrieval performance of many LLMs is significantly affected by the relative position of information within their input contexts.
Different long-context model families and models are affected to different degrees and in different ways. Most models do not perform well across their advertised context lengths — do not take the advertised context lengths at face value!
In general, models perform worse in the following three scenarios:
1 - When we use more of a model’s available context
For example, gpt-4o-mini retrieves 100 of 1,000 phone book entries almost perfectly (~18K tokens) (see chart below). However, when presented with 5,000 phone book entries (~90K tokens), retrieval accuracy from the first 1,000 phone book entries drops significantly, ending up at ~70% around the 1,000th entry. This is unsurprising, as it is essentially the problem of context saturation.
2 - When we retrieve data that is located later in the context
This is especially true when using very large contexts for many models. Using gpt-4o-mini again as an example, retrieval accuracy starts out almost perfect when retrieving from the beginning of the context window regardless of context size. However, with 5,000 phone book entries (~90K tokens), accuracy drops quite linearly to ~0% when retrieving from the end of the context window.
3 - When we attempt to retrieve more items, given the same input context length
For example, gpt-4o-mini performs decently well when retrieving 10 of 5,000 phone book entries (~90+K tokens, ~80% minimum accuracy). Accuracy drops to ~70% when retrieving 20 entries, and crashes to ~20% when retrieving 100 entries (with the same input context length).
No models are completely immune from these effects, though some models and model families appear more susceptible to these effects than others, based on their published supported context lengths.
Practical applications
The phone book test is a very strict test, given that answers need to be exactly correct — there is zero tolerance for hallucination. It is also a very “sterile” test, because we do not need to post-process model responses at all. In comparison, most real-world applications have different — likely more forgiving — requirements. For example, retrieving information from books or document collections will generally be less taxing, because information in books is often repeated, leading to lower “information density”. Such textual retrieval is often semantic in nature; however internal semantic associations within inputs to an LLM might also lead to a higher tendency to hallucinate.
We should be cautious about low retrieval accuracy not so much because the model is unable to find some information within its provided context, but because it may hallucinate and (convincingly) return completely false information. These are often difficult to detect and remedy in production applications.
Note that retrieval accuracy cannot be evaluated entirely independently from instruction following ability. If a model is unable to follow instructions well in some situations (e.g. context saturation), it won’t be able to retrieve accurately. For example, Jamba 1.5 Large often returned completely hallucinated output (even for the easiest test of M=1,000 and N=10):
Here is another notable response:
Detected as DDoS attack.
How should you structure retrieval tasks?
From the data, it is clear that we cannot blindly trust LLMs to retrieve accurately from large input contexts. When faced with such retrieval problems, we will need a more sophisticated design — the LLM is simply one (rather convenient) tool out of many possible tools in our AI engineering toolkit.
Here are some questions that you should ask:
- Volume:
- Haystack size: How much information do you need to retrieve from? If it isn’t too much, it is probably alright to use an LLM.
- Needle count: How much information do you need retrieved at each time? If you need to retrieve a lot, it is probably better to avoid using LLMs or shard the retrieval.
- Nature of data: What does that information look like — is it mainly structured (e.g. key-value pairs like phone books, pricing data, inventory data; tabular data e.g. customer records, logs, transaction data) or unstructured (e.g. books, instruction manuals, etc)? Is it all text, or is it multi-modal in nature?
- For structured data, you are probably better off pre-processing it more and using some kind of code-generation approach — it is also easier to build a self-consistent retrieval system with structured data (i.e. being able to cross-check or validate the retrieved data).
- For unstructured data, especially textual data that have lots of semantic associations, LLMs will probably be a convenient tool.
- Note that if the information is domain-specific, it is probably not a good idea to use generally pre-trained LLMs; consider using an LLM fine-tuned with the right domain knowledge. If there are specific and important keywords in your domain, consider a hybrid system where the keywords are indexed and use BM25 or similar algorithms for retrieval.
- Task Granularity: Can your retrieval task be decomposed into multiple smaller and simpler tasks? For example, instead of retrieving 10 key-value pairs from a corpus all at once, could you retrieve a single key-value pair from the corpus (potentially a smaller corpus) each time, ten times in parallel? This would make your retrieval problem more solvable with LLMs, since LLMs retrieve better from smaller input contexts and fewer needles.
- Freshness: How frequently updated is that information? This will determine your storage and caching strategy.
- Latency: How quickly do you need to retrieve? If you have a greater time budget — e.g. offline or batched tasks that aren’t time sensitive, or cases where you know queries in advance — you can introduce more verification/error correction (e.g. with a critic, or with a re-ranker) downstream.
- Cost: What is your budget for the retrieval task? If you are extremely cost-sensitive, LLMs won’t be a good choice.
- Error tolerance: How much room is there for mistakes? Do you want to bias towards greater recall (i.e. retrieving more information in the hopes of ensuring that the correct answer is amongst the data fetched) or greater precision (all information retrieved must be as correct as possible)? Whether you can afford (time-wise and cost-wise) to use a re-ranker downstream also affects your overall design.
- Development speed: How much time do you have to build out the system? If you have no time at all and want something working quickly, you could probably survive with using LLMs to retrieve data, as a prototype, before iterating on it. On the flip side, more complex systems will take more time to build and require more maintenance in the future.
Note that this is just an initial list of questions — it is not meant to be comprehensive. A more complete treatment of this topic will certainly require a separate blog post or paper.
If LLMs are suitable for your retrieval task, which long-context models should you use?
Depending on your specific use case and your sensitivity to these eight factors, it may still be viable to use long-context models for retrieval.
- If you need to retrieve information from up to ~100K tokens, you have a good number of choices: gpt-4o, all of the Claude models, Gemini 1.5 Flash (regular), and Gemini 1.5 Pro. All of these scored >80% in the M=5000, N=100 experiment. Of these, Gemini 1.5 Flash is the cheapest, and gpt-4o has the best performance.
- If you need to work between ~100K and ~200K tokens, Claude Haiku will probably be the most cost-effective, at $1/1M input tokens (or $0.10/1M tokens if you use context caching, which is viable if you expect to field a large number of requests in short order with the same input tokens). If you are less price sensitive, Gemini 1.5 Pro will net you better performance, for a higher cost of $2.50/1M input tokens.
- If you need between 200K and 1M tokens, want a simple architecture and are not sensitive to cost and latency, Gemini 1.5 Pro is your only viable choice. And it isn’t a bad choice at all — performance is consistently >80% at up to 500K tokens, and it is still >50% at up to 1M tokens.
- If you need >1M tokens or if you are sensitive to cost and/or latency, I would advise you to invest in a more sophisticated RAG architecture, e.g. chunking documents and retrieving chunks by relevance (semantic or otherwise) before generation.
Most production applications probably fall into the last category — being sensitive to cost and/or latency. However, if you are building a quick prototype and don’t yet need to scale or optimize for cost/latency, some large-context models are very viable options.
Potential explanations
Why do we observe poorer retrieval performance when we (1) use more of a model’s available context (context saturation), (2) retrieve data from later in the context, and (3) when we retrieve more information from the context?
The first and third observations may seem fairly obvious at first glance — after all, it is perhaps natural to expect degraded performance when we are asking more of a model. However, the second observation does not conform to this logic — whether we are retrieving information from the beginning, middle or end of a model’s input context, we are in some sense still placing the same load on the model, by virtue of having exactly the same haystack and needle count (read: input token size), and the same reasoning or instruction following requirements.
I am not an expert in LLM model training — but I think that the design and training strategies of LLMs likely contributed to these phenomena. Specifically, (i) attention mechanism limitations, (ii) positional embeddings, and (iii) training data bias.
Context saturation: accuracy loss with larger contexts
As we feed more tokens into a model, its retrieval performance begins to drop off — a phenomenon that can be attributed to context saturation. Attention mechanisms, which assign weights to tokens based on relevance, struggle to handle the sheer number of interactions in longer sequences. With a haystack of 10,000 tokens, for instance, the attention allocated to each individual token may be less than 0.01%. This dilution makes it increasingly difficult for the model to identify and prioritize the correct “needles,” especially when the haystack is cluttered with irrelevant information.
The problem is compounded by training biases. Most models are trained on shorter contexts due to both the natural length distribution of training datasets and the practical constraints of computational resources. This means they perform well when haystacks are compact — say, 1,000 to 2,000 tokens — but falter as we approach their maximum context lengths. Even though a model might claim to handle 128K tokens, its ability to effectively utilize that length is often much more limited in practice. The result? Retrieval accuracy plummets as we ask the model to handle larger haystacks, even when the needle count remains fixed.
Attention decay: accuracy loss later in the context
Even with a fixed haystack size, the position of the needle can drastically affect retrieval performance. Models consistently retrieve more accurately from the beginning of the input sequence, with accuracy dropping off as we move further into the context. This is partly due to positional embedding decay: the way models encode token positions becomes less precise as positions grow larger. Tokens near the 90% mark of a sequence are effectively represented with “blurry” embeddings, making it harder for the model to distinguish them accurately.
Attention mechanisms also introduce a proximal bias, where tokens closer to each other in the sequence receive disproportionately more attention. This bias is particularly pronounced in models that use sparse attention strategies, like windowed or local-global mechanisms, which prioritize closer token relationships at the expense of global ones. For example, retrieving a phone number from the last bin of a 5,000-entry phone book (~90K tokens) is much harder than retrieving one from the first bin, even though the haystack size and needle count remain constant. This bias is further exacerbated by limited training on tasks that involve retrieving from deep within a context, leaving models ill-prepared to handle information located far from the start of the sequence.
Needle overload: accuracy loss with more needles
Finally, retrieval performance takes a nosedive as the needle count increases, even when the haystack size remains unchanged. This happens because the model’s attention mechanism, already stretched thin, now has to allocate resources across more retrieval targets. For instance, when asking a model to retrieve 100 needles from a 5,000-entry phone book (~90K tokens), each needle competes for the same limited pool of attention, leading to a substantial drop in accuracy. The more needles we add, the harder it becomes for the model to find the correct ones without interference from distractors.
This issue ties back to context saturation, but with an added layer of complexity. While models are often optimized for single-query tasks (retrieving a single needle from a haystack), they struggle with high needle counts because such scenarios are rarely encountered in training. The phone book task — with its synthetic, evenly distributed needles — exposes this limitation starkly. Models aren’t just being asked to recall information; they’re being overwhelmed by the sheer density of retrieval demands, which further amplifies the challenges posed by diluted attention weights and saturating contexts.
Quantifying long-context retrieval accuracy loss
There are myriad ways to quantify the accuracy loss due to the phenomena above. Any attempt to quantify a model’s performance necessarily forces some kind of compression — e.g. trading off some level of detail (relative to the full “performance profile”) — for the ability to compare performance more easily across models.
Here are some possible metrics to do so:
- Maximum Usable Context (MUC): For a given model, this is the maximum context length where retrieval performance for 100 needles does not drop below 80% along any part of the context. Ideally, this should be equal to the model’s advertised context length.
- Context Saturation Index (CSI): How much does retrieval accuracy drop, in absolute percentage terms, when going from M=1K to the specified haystack size, for 100 needles? Scale of 0 to 1, lower is better.
- Attention Decay Index (ADI): How much does retrieval accuracy drop, in absolute percentage terms, from (typically) the beginning of the context to later in the context (typically near the end) when retrieval accuracy is the lowest, for 100 needles and the specified haystack size? Scale of 0 to 1, lower is better.
- Needle Overload Index (NOI): How much does retrieval accuracy drop, in absolute percentage terms, when we increase the no. of needles from 10 to 100, for the specified haystack size? Scale of 0 to 1, lower is better.
- Retrieval Resilience Score (RRS): 1 minus the average of CSI, ADI and NOI. Scale of 0 to 1. A higher score indicates greater resilience to context saturation, attention decay and needle overload.
- Long-Context Retrieval Dependability Score (LCRDS): Average retrieval accuracy percentage at the specified haystack size for 100 needles, multiplied by RRI as a percentage. Scale of 0 to 100, higher is better.
We can measure CSI, ADI, NOI, RRI and LCRDS for each model at different haystack (context) sizes. A strong model should have CSI, ADI and NOI close to zero, which would result in a RRI close to 1, and LCRDS close to 100% (i.e. perfect retrieval).
Metrics for selected models [PLACEHOLDER]
Model | Max supported context | Context as tested | CSI | ADI | NOI | RRS | LCRDS |
---|---|---|---|---|---|---|---|
xAI / grok-2-1212 |
131K | ~120K | 0.025 | 0.025 | 0.000 | 0.983 | 96.67% |
OpenAI / gpt-4o-2024-08-06 |
128K | ~120K (max) | 0.032 | 0.032 | 0.004 | 0.977 | 96.27% |
Detailed results
Openai: gpt-4o-2024-08-06
and gpt-4o-mini-2024-07-18
The latest OpenAI models have a context window of 128K, and 4o performs extremely well across its entire context window. However, 4o-mini is significantly weaker.
At M=1000 and N=10, 4o retrieves everything correctly. 4o-mini performs very well too, though with significantly more variance:
At M=5000 and N=10 (which is approximately ~90+K tokens for both models, or 75% of their context window), 4o-mini starts deteriorating, hitting ~80% performance at the end.
Things get worse for 4o-mini at M=5000 and N=50:
At M=5000 and N=100, 4o-mini shows a very linear decline in performance from the beginning to the end of its context window, ending at less than 20%. Note that since M=5000 is only around 75% of its context window, there is a very good chance that performance will drop to practically zero if we exhaust all 128K tokens!
At the same time, 4o holds its ground and still manages close to perfect retrieval across its context window at M=5000 and N=100.
4o-mini is useable (>= 80% retrieval accuracy) at around 20% of its context window, which is around 25K tokens.
Openai: o1-2024-12-17
OpenAI's o1 series of models were released in preview in September 2024 and in full in December 2024. They promise even stronger reasoning ability than the gpt-4o models, by "thinking" before responding.
Given that o1 is much more expensive than gpt-4o (at $15 vs. $2.50 per 1M input tokens), I don't think that we will be seeing many long-context production applications of o1. However, o1 is the first OpenAI model to feature a context window larger than 128K – coming in at 200K, it matches the Claude family of models.
As expected, o1 performs well at M=1,000 – although there is a hint of variance even at this (relatively) small context.
At M=5,000, retrieval performance dips significantly – it starts out well, but drops almost linearly and ends at ~40% at the end. This is the case for all needle counts; the context saturation effect is much stronger than the "needle overload effect", as increasing the needle count from N=10 to N=100 does not cause performance to decline any further.
In fact, gpt-40-mini performs better than o1 at M=5,000 and up to N=50!
The trend continues as we move up to M=10,000 – there is basically no retrieval ability past the 40% mark, for any needle count.
This is disappointing, for even Claude's cheapest model (haiku) performs significantly better at haiku's limit of ~M=8,000:
OpenAI: o1-mini-2024-09-12
The o1-mini model is much cheaper than o1, and costs roughly the same as 4o if we just consider input tokens – $3 for o1 vs. $2.50 for 4o for 1M input tokens. It also retains a 128K context window, instead of the 200K window that o1 has. OpenAI advertised o1-mini for coding-related tasks, and I have observed it to be really strong for this use case.
However, coding-related tasks often involve providing huge chunks of code – often entire code files – so long-context retrieval ability is quite important.
At M=1,000 (roughly ~20K tokens), performance already starts tapering off after the 60% mark. Note that there is a hint of needle overload here, as performance for N=10 is much better than N=50 or N=100, at the end of the input context.
At M=5,000, o1-mini is basically unusable, having no retrieval ability whatsoever after the ~20% mark. The chart below shows data for just N=10 and N=50, as I could not get o1-mini to run consistently for N=100 – in each set of 20 runs, I would encounter API timeouts after about 4 or 5 runs.
So tl;dr is: beware of providing large snippets of code to o1-mini! It is probably better at taking in a set of concise instructions and generating code, rather than analyzing code and suggesting changes (for example).
Claude: claude-3-5-haiku-20241022
, claude-3-5-sonnet-20241022
and claude-3-opus-20240229
The Claude family of models all have a 200K context window. Opus is the most powerful, followed by Sonnet and Haiku. These are priced respectively at $15, $3 and $1 per million input tokens — which seems relatively expensive, given that the Gemini family is cheaper by an order of magnitude at $2.50, $0.15 and $0.075 for Pro, Flash and Flash 8B!
At M=1000 and N=100, all three models perform very well with just a hint of variance.
Things start to get interesting at M=5000 and N=100 (~102K tokens), with retrieval performance for all models dropping rather linearly to ~80% towards the end. Opus seems to have slightly less variance and performs just marginally better than Haiku and Sonnet, at around ~85% at the end.
At M=8000 and N=100 (~161K tokens), performance takes a further hit across all models, with Haiku and Sonnet dropping to 60+% at around the 2/3 mark, while Opus drops to ~80%. There is an interesting U-shaped performance curve, which is reminiscent of findings from the Lost in the Middle paper. The variance for all models is also huge — watch out if you use these models in production for long-context retrieval!
However, it is heartening that all three Claude models still perform decently at ~80% of their context length, and do not fail catastrophically (e.g. 4o-mini).
These are the performance graphs for each model:
Gemini: gemini-1.5-flash-8b
, gemini-1.5-flash-002
, and gemini-1.5-pro-002
The Gemini models currently have the longest context windows for any commercially available model today, topping out at 1M for both 1.5 Flash 8B and 1.5 Flash, and 2M for 1.5 Pro. Google has also priced them really competitively, and (anecdotally) many people favor 1.5 Flash for its sweet spot of price and performance. These models are also all multimodal, and long context is really useful for multimodal inputs — Google claims that 1.5 Flash (regular and 8B) support 3.6K images, 1 hour of video, or 9.5 hours of audio per prompt. I’ve not had the opportunity of sending so much media in a single LLM call, but these do sound wild.
At M=1,000 and N=100, both Flash regular and Pro have more or less perfect performance, while Flash 8B performs pretty well for the first half and drops to zero for the second half. Note that this is only around 60K tokens — a mere 5% of Flash 8B’s maximum context window — so this is a disappointing result.
At M=5000 and N=100 (~120K tokens), both Pro and Flash regular start to falter a little towards the end, but still hold up really well. Flash 8B, on the other hand, starts off at ~70% and crashes to 0% by around 90% of input tokens (or even earlier - ~75% - a 2-degree polynomial does not really fit well here).
At M=10,000 and N=100 (~240K tokens), Flash 8B’s performance decline continues — starting even lower at 60%, and hitting close to 0% by the halfway mark. Flash regular also worsens after ~40% of input context, linearly dropping to ~10% at the end. Pro holds up amazingly well here.
At M=20,000 and N=100 (~480K tokens), the story continues for both Flash regular and Flash 8B — both basically have no ability to retrieve properly after ~60% and ~40% of input tokens respectively. Pro linearly degrades to around 80% at the end, which is still really impressive.
At M=40,000 and N=100 (~1M tokens), Flash regular drops precipitously to zero at around 30% of the input context, while Pro drops to around 50% at the end. I didn't see any point in evaluating Flash 8B here, since its retrieval accuracy is guaranteed to be meaningless.
As a stress test, let’s see how Pro handles M=80,000 and N=100, which is just under its maximum context window of 2M tokens. (Note that I only did 10 runs for this, as sending 2M * 20 = 40M tokens would have been quite expensive at $2.50 * 2 * 20 = $100!)
Pro performs very similarly with 10 needles instead of 100 at M=80,000 – indicating that context saturation is the dominant effect at play here.
There is still a little retrieval ability past the halfway mark (which would be 1M tokens in), and this remains the case till the end of the context window. I would not consider this useable. If we define 50% retrieval performance as the cut-off point, the useable context is probably around 1M tokens.
In summary, these are the retrieval accuracy curves for Gemini 1.5 Pro (up to M=80K, approximately 2M tokens):
And this is the same graph for Gemini 1.5 Flash, up to M=40K (approximately 1M tokens). Please ignore the terrible fit for the M=40K curve…
Finally, this is the graph for Gemini 1.5 Flash 8B (up to M=20K, approximately 0.5M tokens):
Gemini: Comparing 2.0 Flash (experimental) vs. 1.5 Flash
For M=5,000 and N=100, Gemini 1.5 Flash seems to have more consistent performance overall, and has better retrieval accuracy from ~40% of the context window till the end. Gemini 2.0 Flash (experimental) has a very slight edge (which may not be statistically significant) in the first 40% of the context window, but falls dramatically after that. It is likely that Google optimized 2.0 differently from 1.5.
Also, Gemini 2.0 Flash had a tendency to refuse to generate text, often throwing errors such as:
ValueError: Invalid operation: The `response.text` quick accessor requires the response to contain a valid `Part`, but none were returned. The candidate's [finish_reason](https://ai.google.dev/api/generate-content#finishreason) is 4. Meaning that the model was reciting from copyrighted material.
I had not noticed such errors with Gemini 1.5; perhaps this is due to over-aggressive moderation by Google. Hopefully this is corrected by the time the full version of Gemini 2.0 is released – based on Google's assessment, 2.0 Flash is stronger at most benchmarks than even 1.5 Pro (and certainly 1.5 Flash).
Notably, 2.0 Flash scored slightly lower in the Multi-Round Co-reference (MRCR) long-context benchmark than even 1.5 Flash, at 1M tokens. This is consistent with our findings above, although the difference is not as stark. The MRCR might highlight the effects of context saturation slightly less than this phone book experiment – it essentially presents models with a long conversation, and contains two confusingly similar needles (key-value pairs) amongst a huge amount of distractor tokens. Models are expected to retrieve the value for one of the two keys. Scoring is typically done with a string similarity check.
Llama family of models
I tested a variety of Llama models across different generations, all of which supported ~128K context windows. These included:
- Llama 3.1 8B Instruct Turbo
- Llama 3.1 70B Instruct Turbo
- Hermes 3 by Nous Research- which is essentially a fine-tuned version of Llama 3.1 405B.
- Llama 3.2 90B Vision Instruct
- Llama 3.3 70B Instruct Turbo
At M=1,000 and N=100, retrieval performance is generally decent, hovering at the ~90% mark. There is a lot of variation however.
At M=5,000 and N=50, retrieval performance is significantly degraded and even more scattered.
At M=5,000 and N=100, performance continues dropping – Llama 3.1 8B has practically zero retrieval accuracy. I was unable to complete this benchmark for Llama 3.2 90B Vision due to perpetual server timeouts.
In a nutshell, the Llama models aren't standouts for long-context retrieval accuracy, especially in complex multiple query / multiple needle cases.
Grok 2
The grok-2-1212
model performed remarkably well, notching nearly perfect retrieval near its context limit.
There is one unfortunate caveat: Grok 2 appears to require about ~25% more tokens for the same amount of text as most other models – a haystack size of M=4,000 required ~120K tokens, while most other models would hit the same amount of tokens with M=5,000.
That said, Grok 2 could be a good model to use for long-context retrieval. Just bear in mind that at $2/1M input tokens, it costs ~27x as much as Flash 1.5 (regular) which sits at $0.075/1M tokens (for prompts <= 128K tokens)! Flash's performance is very marginally lower than Grok 2.
A121 Jamba: 1.5 Large and 1.5 Regular
I had high hopes for Jamba 1.5, because of its generous 256K context window. However, both Jamba 1.5 large and Jamba 1.5 mini did not perform well at all, contrary to their top-3 rankings in the RULER paper.
At M=1000 and N=10, performance was already around 50% for the mini model and around 80% for the large model:
For M=1000 and N=100, the large model struggled significantly:
Things became consistently worse at M=5,000:
Performance was similar at M=10,000, but only because there wasn't much room to get worse. (Note that the colors representing each model are different in the following chart.)
Amazon Nova
Nova models are promising, with great performance benchmarks and a 300K context window. However in practice I’ve had problems running benchmarks for M >= 800 — responses have come back empty. I’ve tried running Nova on both Bedrock and OpenRouter, with similar results.
Deepseek: R1 and V3
Deepseek R1 has garnered lots of attention upon its release. So much has been said everywhere and I won't bother distracting everyone with my thoughts here (though perhaps in another article...).
R1's thinking is really evident. While its thinking has been much lauded and provides a certain level of transparency and self-assessment – generally leading to improved responses – such thinking may also be its Achille's heel. My rough estimate is that its thinking may increase the number of tokens generated by around 5x. This is generally not great for any time-sensitive production application, although it may not be a problem for offline or batched use cases. The providers that I tried Deepseek on generally had slow inference – perhaps as a consequence of such thinking, or perhaps due to the extreme level of interest in Deepseek.
As an example, for M=5,000 and N=10, inference took anywhere from 5 minutes and up – timing out in most cases. All of my attempts at M=5,000 and N=100 timed out, so I was unable to derive any metrics from R1.
Here is a sample response for M=1,000 and N=10:
<think>
Okay, let's tackle this phone number lookup. The user provided ten names, and I need to find each one's phone number from the given phone book. Let me start by going through each name one by one.
First name: dcemraddoyuqcqb. I'll scan through the list. Hmm, found it! The entry is "dcemraddoyuqcqb : (148) 463-2717". Got that.
Next up: knmhjacfkwdfknl. Scrolling down... There it is: "knmhjacfkwdfknl : (508) 096-5261". Noted.
Third name: hbbjanaojscznxm. Let me look... Found: "hbbjanaojscznxm : (230) 201-8214". Okay.
Fourth: ekrywmfrfgfkrof. Searching... Here's the entry: "ekrywmfrfgfkrof : (447) 316-1962". Got it.
Fifth: vgegiumkgwwnyvm. Scanning... Found: "vgegiumkgwwnyvm : (633) 982-3386". Alright.
Sixth: qnitwcbasqbupoo. Looking through the list... There it is: "qnitwcbasqbupoo : (822) 760-7089". Wait, no, that's not right. Let me check again. Oh, here's the correct entry: "qnitwcbasqbupoo : (822) 760-7089". Wait, no, actually, looking again, maybe I misread. Let me confirm. The entry for qnitwcbasqbupoo is actually "qnitwcbasqbupoo : (822) 760-7089". Hmm, maybe I need to double-check. Wait, the user might have a typo, but assuming the name is correct, I'll go with what's listed.
Seventh: kquxuknaflfysml. Searching... Found: "kquxuknaflfysml : (814) 757-1852". Okay.
Eighth: khfyvmreaovowcx. Let me find that... Here: "khfyvmreaovowcx : (780) 942-1549". Got it.
Ninth: ffstlwilmmipusk. Scanning... Found: "ffstlwilmmipusk : (387) 705-3857". Alright.
Tenth: jkhknzbgtfjkrbe. Finally, looking for this... Here it is: "jkhknzbgtfjkrbe : (437) 889-6692". That's all ten.
Wait, let me cross-verify each entry again to ensure accuracy. Sometimes similar names can be confusing. Let me check each one again quickly. Yes, all entries match the names provided. Looks like all are present except maybe qnitwcbasqbupoo. Wait, in the phone book, the entry is "qnitwcbasqbupoo : (822) 760-7089". Yes, that's correct. Alright, I think that's all. All ten names have their corresponding numbers listed.
</think>
dcemraddoyuqcqb : (148) 463-2717
knmhjacfkwdfknl : (508) 096-5261
hbbjanaojscznxm : (230) 201-8214
ekrywmfrfgfkrof : (447) 316-1962
vgegiumkgwwnyvm : (633) 982-3386
qnitwcbasqbupoo : (822) 760-7089
kquxuknaflfysml : (814) 757-1852
khfyvmreaovowcx : (780) 942-1549
ffstlwilmmipusk : (387) 705-3857
jkhknzbgtfjkrbe : (437) 889-6692
For M=1,000 and N=100 (which was most strenuous case I could reliably run), R1 performed almost perfectly:
Given the sheer "output token multiple" for inference, I would not recommend R1 for production retrieval use cases with multiple queries/needles. My intuition (which is very likely wrong) is that R1's strong reasoning does not necessarily shield it from the effects of context saturation, attention decay and needle overload.
Deepseek V3 performed just slightly worse than R1 for M=1,000 and N=100 – its retrieval accuracy dropped by ~5% at around the 80% mark:
I was unable to complete 20 runs for M=5,000 and N=100 due to timeouts, but I did manage to get a chart for M=5,000 and N=50:
Overall, performance is good but not great here. Retrieval accuracy is guaranteed to drop further with N=100, likely beyond the point of usability.
Last Thoughts
Follow-ups
It would be interesting to see how various prompt changes might affect the retrieval performance of various models. Some possible changes:
- Repeat instructions at the very end of the input context.
- Provide some number of examples, introduce chain-of-thought, etc.
- Specify the phone book and needles in the user prompt instead of the system prompt.
In this blog post, I have also not assessed the effects of temperature, top_p, and other variables.
Code
The source code for everything in this blog post is available on this Github repo. (Note that the code isn't the tidiest; the repo began as a tiny experiment in October 2024 and grew rather organically.)
Credits
Thank you to Kiran, Ary, Toby and others for feedback and comments on earlier drafts of this blog post.
And thank you to Ario for sponsoring all of the LLM calls performed for this blog post!