Why 10 million tokens won’t save your AI agent (and what will)

We are currently living through the AI equivalent of a horsepower war.

Not so long ago, major labs were announcing a model with an astronomically larger context window every few months. We went from 8,000 tokens to 100,000, then to a million. Today, you can find experimental models like Meta’s Llama 4 Scout boasting a 10-million-token capacity.

The underlying assumption is simple, if somewhat brute-force: if we just give the AI a big enough brain to hold every piece of documentation, every line of code, and every chat history simultaneously, it will finally be able to execute complex, long-running tasks autonomously.

But, if you’ve actually tried to build an agent that runs for days, weeks, or months, you already know the dirty secret of the AI industry. Giving an agent a massive context window doesn't make it a genius. It makes it a digital hoarder.

Stop the context rot

When you stuff a prompt to the gills, you trigger a phenomenon we call “context rot”.

As the context window fills up, the model’s performance actively degrades. It starts hallucinating. It loses the plot. It forgets the original objective you gave it three days ago, stops working, and nervously asks you: "Should I keep going?"

Thanks to the “needle in a haystack” test, a way to test LLMs’ in-context retrieval capabilities, the community has realized that simply throwing more tokens at the problem is a dead end.

The future of autonomous AI isn't about building a bigger disembodied brain. It’s about building a better office for that brain to work in. We call this the "harness."

Stop managing the prompt, start engineering the environment

Think about how you - a human - execute a project that takes three weeks.

You don’t try to hold the entire codebase, all your Jira tickets, and every Slack message in your active working memory at the exact same time. You’d lose your mind.

Instead, you use your environment. You write things down. You put files in folders. You leave yourself sticky notes. You delegate tasks to coworkers.

We need to stop treating AI agents like isolated brains trapped in a chat box and start giving them the same environmental affordances we rely on.

The "harness" surrounds the LLM with the environmental tools - like file systems and memory backends - needed to execute long-horizon tasks.

It is the environment where the agent lives. It is the scaffolding, the tools, the permissions, and the sandboxes we wrap around the foundational model.

If we want an agent to autonomously crawl the web, write code, run tests, and fix its own bugs over a month-long horizon, we have to teach it how to ruthlessly manage its own context.

Here is how we actually make that happen.

The power of context offloading

The golden rule of long-running agents is simple: Do not load what you do not immediately need.

Imagine your agent needs to find a specific event date buried in a massive HTML file. The amateur approach is to dump the entire raw HTML into the context window. Congratulations, you’ve just burned 8,000 tokens, cluttered the agent’s working memory, and invited hallucinations.

The professional approach is “context offloading”. Instead of giving the agent the HTML, you give the agent a secure sandbox - a temporary, isolated workspace where it can read, write, and execute code. You give it a goal; it decides what to do and in what order, like downloading the HTML to a file in that sandbox and running a simple terminal command (like grep) to search for the date.

The agent gets the exact answer it needs. It uses four tokens instead of 8,000. And its mental whiteboard remains perfectly clean.

This sandbox approach changes everything. For agents that run for weeks, they can use their file system to save their own state. If an agent realizes its context window is hitting 80% capacity, it can proactively write a summary of its progress to a text file. When that specific instance of the agent hits its limit and dies, the next agent in the relay simply reads the summary file and picks up exactly where the last one left off. It’s an approach that has been popularized by Anthropic in Claude.

Mental housekeeping: Compaction and summarization

Developers are inherently lazy. When we use AI coding assistants, we rarely start fresh, clean chat threads. We just keep iterating in the same window.

For an agent, this is fatal. If you asked an agent to read five files 20 minutes ago, those files - and the long tool-call outputs associated with them - are still sitting in its context window, silently degrading its ability to reason about your current request, costing users time and money..

To keep agents running indefinitely, the harness must perform mental housekeeping. We do this through compaction and summarization.

Compaction is the act of automatically trimming the fat. Every few turns, the harness quietly reaches into the agent's context and deletes old, bulky tool responses, replacing them with a tiny note that says: "If you need this result again, it is saved in [File X]."

Notice how the context length drops sharply during regular “compaction” phases, and resets entirely during a “summarization”.

When compaction isn't enough and the context window inevitably fills up, we trigger summarization. The harness pauses the agent, hands its entire messy context to a secondary, cheaper model, and says something like: "Summarize what has been done and what needs to happen next." The harness then wipes the main agent’s memory entirely, inserts only that brief summary, and lets it start fresh.

Defending the context layer with subagents

A good manager doesn't get bogged down in the weeds of a menial task; they delegate it. Deep agents should do the same.

When a long-running agent encounters a massive, complex sub-task - like running an exploration algorithm or testing a new piece of code - it shouldn't do that work in its main reasoning loop. That will cause massive context rot.

Instead, the harness should allow the main agent to spawn a temporary copy of itself.

The main agent writes a specific prompt for this "subagent," spins it up in an isolated environment, and waits. The subagent does the heavy lifting, burns through thousands of tokens, finds the answer, returns only the final result to the main agent, and then terminates.

This is how you defend the primary context layer. You keep the main agent's mind clear, focused solely on high-level orchestration, while disposable subagents take the cognitive hit of the dirty work.

Build less, understand more

Recently, I was talking to a developer who worked on Manus AI (an autonomous system recently acquired by Meta). He summarized the future of this space perfectly: "Build less, understand more".

For a long time, the instinct in AI engineering has been to micromanage the models. We write labyrinthine, 5,000-word system prompts outlining every possible edge case, hoping to control the agent's behavior.

But long-running autonomy doesn't come from a perfectly engineered prompt. It comes from the harness.

If you want an agent that can work for 30 days straight, stop trying to shove the entire world into its context window. Give it a file system. Give it a terminal. Give it the ability to delegate to subagents, summarize its own thoughts, and offload its memory to a sandbox.

If you give an AI the right environment, you won't need 10 million tokens for it to change the world.