Your AI Gets Dumber the Longer You Use It in a Single Session

Chroma tested 18 frontier models on long-context tasks. Every single one degraded. Here is what that means for how you work.

May 13, 2026

Something happens in most AI work sessions around the 30 to 40 minute mark. The output quality slips. Responses get vaguer. The model starts missing things that were stated clearly earlier, or contradicts something it said 10 messages ago. Most people hit reprompt a few times and eventually open a new chat.

The cause has a name: context rot.

Chroma's research team spent 2025 running the most systematic test of long-context AI performance published to date. They tested 18 frontier models, including GPT-4.1, Claude Opus 4, Gemini 2.5, and Qwen3. Their finding: “models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows.” Every model. Every length increment tested.

For operators running AI-assisted workflows daily, this is the mechanism behind quality problems that usually get blamed on prompting, model choice, or user error. The model is working as designed. What is breaking is how sessions are structured.

Why the quality drops

When you send text to a language model, it does not read linearly. It computes relationships between every token and every other token simultaneously. At 10,000 tokens, that is 100 million pairwise computations. At 100,000 tokens, where a 40-minute AI session often lands, it is 10 billion.

As context grows, attention dilutes. Each relevant piece of input receives proportionally less computational weight against all the other tokens in the window. The model still processes everything. It just cannot attend to any of it as precisely.

Researchers at Stanford documented a second mechanism in a 2024 paper: the “lost-in-the-middle” effect. Models follow a U-shaped attention pattern across long inputs. They are reliable at the start, reliable at the end, and measurably weaker in the middle. In multi-document tests with 20 documents, accuracy dropped by more than 30 percent when relevant information was placed in middle positions versus the beginning or end of the input. Same model. Same information. Different position. Thirty-plus percent accuracy gap.

The practical implication: anything you add mid-session is processed at a fraction of the quality it would receive at position 1. Your system prompt at the top gets solid attention. A key piece of context you paste at message 15 competes with everything that came before it.

Chroma also documented a third mechanism called distractor interference. Semantically similar but irrelevant content degrades model performance beyond what context length alone explains. Every time you paste in background material that is related to the topic but not the specific task, you are adding noise the model has to compete against. The Chroma team found this effect compounds: four semantically close distractors hurt more than one, and the damage is not predictable in a straight line.

The spec sheet number is the wrong number

Every frontier model advertises a context window. That number is the technical ceiling: the maximum the API accepts without error. Almost no lab publishes the effective context length, the point at which the model still reasons reliably over what it received, not merely accepts tokens without error.

A May 2026 benchmark analysis tested multi-needle retrieval tasks (the kind that resembles real business work, finding and integrating multiple pieces of information across a long document) and found: “the gap between what a model accepts and what a model can reliably use is enormous, and almost nobody in marketing pages is being honest about it.”

The numbers from those benchmarks: Claude Opus 4.6 at 128K tokens with 8 retrieval targets scores 93 percent. At 1 million tokens, it drops to 76 percent. That is the category leader. Most other models sit in the twenties and thirties at 1 million tokens.

Llama 4 Scout launched in April 2025 with a 10 million token context window as its headline feature. On long-context reasoning benchmarks, it scores 15.6 percent. The model with the largest advertised context window posted the worst long-context reasoning numbers.

The honest translation of most context window spec sheets: the effective working range is roughly 50 to 70 percent of the advertised number, with continuous quality degradation starting from the first token, well before you near the ceiling.

For three years I have watched operators select models based primarily on advertised context window size, treating it as a proxy for capability. It is a marketing specification. The effective context is what you are actually working with, and almost no one publishes it in any useful way.

The 35-minute wall

Research on long-running AI agents identified a consistent threshold: agent success rates drop after approximately 35 minutes of continuous operation. The failure relationship is non-linear. Double the session length and the failure rate quadruples, because context rot is self-reinforcing.

The loop works like this: a longer session means more accumulated context. More accumulated context means worse output quality. Worse output quality means corrections and re-prompting. Corrections add more context. The cycle accelerates. Sessions tend not to fail gradually. They hold up reasonably well for a while, then drop.

Cognition measured that in long-running AI agent tasks, over 60 percent of the first turn is spent just retrieving context, not reasoning, not producing output. Retrieval. Every search result, every file read, every exploration that turns out to be a dead end stays in the context window for the rest of the session, accumulating like sediment.

For operators using AI for proposals, research, long-form content, or multi-step client deliverables, the 35-minute threshold is a real operational constraint. The sessions that feel like they went off the rails usually hit this point. The model held up as long as it could given what was in the context. The session structure created the conditions for failure.

The assumption that makes this worse

The pattern I keep seeing with operators who use AI every day is a belief that more context makes for a smarter session. Feed the model the full background document, prior conversation history, everything potentially relevant, and the model has more to work with. That feels right.

What actually happens: every piece of context you add lowers the signal-to-noise ratio on the input that matters. A crisp system prompt at the start of a session competes with 15,000 tokens of accumulated exchange for the model’s attention by message 25. The noise floor rises. The model is getting more distracted as the session continues, not smarter.

Anthropic's engineering team has published on what they call “context engineering,” specifically the discipline of curating the optimal token set during inference rather than maximizing what gets loaded in. Their research shows 80 percent of performance variance in long-context tasks comes down to how well the context is managed, not which model is being used.

That reframing is worth sitting with. The capability gap operators try to solve by switching models or paying for larger context windows is often a context management problem. The model has the capability. The session design is limiting what the model can do with it.

Where this legitimately breaks down

Context rot is architectural in current transformer models. Chroma ran their tests specifically to find exceptions and found none across 18 models. This will likely improve as AI architecture evolves, but there is no model on the market today that is immune to it.

RAG (retrieval-augmented generation, where you pull only the relevant documents rather than loading everything into context at once) helps substantially. A well-built RAG setup means the model gets a small, precise slice of relevant information rather than a haystack it has to search through. The catch: most SMB operators are not running RAG pipelines. That requires developer setup and ongoing maintenance, and the gap between what the vendor marketing implies and what it actually takes to implement is real. It is not the quick fix it is often described as.

Starting fresh sessions resets the rot but creates a continuity cost. The model loses the prior conversation. Sometimes that means re-establishing context that took time to build. Short sessions solve one problem and introduce another. There is a real trade-off and no workaround that fully eliminates it in today’s tools.

There is also no single number from a model card you can trust for your specific use case. Effective context length varies by task type, input structure, and what is actually in the context window. Until labs publish multi-needle retrieval benchmark scores at multiple context lengths as a standard part of their model documentation (most do not), you are working with incomplete information on a variable that directly affects your output quality.

Three adjustments worth making

Work in shorter sessions. Thirty minutes of focused AI work in a clean session, carrying only the output into the next session rather than the full conversation, consistently produces better results than one long session where context accumulates. This is the highest-leverage adjustment most operators can make without changing tools, prompting strategy, or workflow.

Front-load what matters. Critical constraints, the specific task, key requirements: those go at the top of the prompt every time. The lost-in-the-middle research is precise on this. Whatever you place in middle positions of a long context processes at a fraction of the attention it gets at position 1. If something matters to the quality of the output, it belongs at the top, not buried after background context.

Narrow the context deliberately before each session. Before you start, ask what the model actually needs to know for this specific task. Paste in that. Leave out the background material that might be useful. Every additional document or prior message you add is competing with the signal that actually matters for the task at hand.

If you are building anything on AI APIs, customer-facing agents, internal tools, automated outbound, and you are not currently measuring output quality as session length or context size increases, that is the first metric worth adding. Build a simple test with your actual output type and run it at 10 minutes, 20 minutes, and 35 minutes. See where your specific workflow starts to slip.

If any of this maps onto how you are using AI in your business and you want to work through your specific workflows, that is what I do in an AI Clarity Call at muddventures.com/book. And if you want to go deeper with a community of operators working through the same problems every day, whop.com/abra-ai is where that conversation lives.

Andrew

Discussion about this post

Ready for more?