PINGDOM_CHECK

Web Scraping Copilot is live. Build Scrapy spiders 3× faster, free in VS Code.

Install Now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    AI-powered IDE Integration

    Web Scraping-Copilot

    The complete, production-ready spider workflow from AI-generated code to cloud deployment. All in VS Code.

  • Data Services
  • Pricing
  • Blog

    Learn

    Case Studies

    Webinars

    Videos

    White Papers

    Join our Community
    Introducing Web Scraping Copilot 1.0: AI-Accelerated web scraping inside VS
    Blog Post
    The seven habits of highly effective data teams
    Blog Post
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
Home
Blog
Why 10 million tokens won’t save your AI agent (and what will)
Light
Dark

Why 10 million tokens won’t save your AI agent (and what will)

Read Time
10 min
Posted on
May 8, 2026
Use case
New models can process larger inputs, and confuse themselves in the process. Context management techniques can solve the problem.
By
Joaquin Bonifacino
IntroductionStop the context rotStop managing the prompt, start engineering the environmentThe power of context offloadingMental housekeeping: Compaction and summarizationDefending the context layer with subagentsBuild less, understand more
×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more
Subscribe to our Blog
Table of Contents

We are currently living through the AI equivalent of a horsepower war.

Not so long ago, major labs were announcing a model with an astronomically larger context window every few months. We went from 8,000 tokens to 100,000, then to a million. Today, you can find experimental models like Meta’s Llama 4 Scout boasting a 10-million-token capacity.

The underlying assumption is simple, if somewhat brute-force: if we just give the AI a big enough brain to hold every piece of documentation, every line of code, and every chat history simultaneously, it will finally be able to execute complex, long-running tasks autonomously.

But, if you’ve actually tried to build an agent that runs for days, weeks, or months, you already know the dirty secret of the AI industry. Giving an agent a massive context window doesn't make it a genius. It makes it a digital hoarder.

HD-1

Stop the context rot

When you stuff a prompt to the gills, you trigger a phenomenon we call “context rot”.

As the context window fills up, the model’s performance actively degrades. It starts hallucinating. It loses the plot. It forgets the original objective you gave it three days ago, stops working, and nervously asks you: "Should I keep going?"

Thanks to the “needle in a haystack” test, a way to test LLMs’ in-context retrieval capabilities, the community has realized that simply throwing more tokens at the problem is a dead end.

The future of autonomous AI isn't about building a bigger disembodied brain. It’s about building a better office for that brain to work in. We call this the "harness."

Stop managing the prompt, start engineering the environment

Think about how you - a human - execute a project that takes three weeks.

You don’t try to hold the entire codebase, all your Jira tickets, and every Slack message in your active working memory at the exact same time. You’d lose your mind.

Instead, you use your environment. You write things down. You put files in folders. You leave yourself sticky notes. You delegate tasks to coworkers.

We need to stop treating AI agents like isolated brains trapped in a chat box and start giving them the same environmental affordances we rely on.

The "harness" surrounds the LLM with the environmental tools - like file systems and memory backends - needed to execute long-horizon tasks.

HD-2

It is the environment where the agent lives. It is the scaffolding, the tools, the permissions, and the sandboxes we wrap around the foundational model.

If we want an agent to autonomously crawl the web, write code, run tests, and fix its own bugs over a month-long horizon, we have to teach it how to ruthlessly manage its own context.

Here is how we actually make that happen.

The power of context offloading

The golden rule of long-running agents is simple: Do not load what you do not immediately need.

Imagine your agent needs to find a specific event date buried in a massive HTML file. The amateur approach is to dump the entire raw HTML into the context window. Congratulations, you’ve just burned 8,000 tokens, cluttered the agent’s working memory, and invited hallucinations.

The professional approach is “context offloading”. Instead of giving the agent the HTML, you give the agent a secure sandbox - a temporary, isolated workspace where it can read, write, and execute code. You give it a goal; it decides what to do and in what order, like downloading the HTML to a file in that sandbox and running a simple terminal command (like grep) to search for the date.

The agent gets the exact answer it needs. It uses four tokens instead of 8,000. And its mental whiteboard remains perfectly clean.

This sandbox approach changes everything. For agents that run for weeks, they can use their file system to save their own state. If an agent realizes its context window is hitting 80% capacity, it can proactively write a summary of its progress to a text file. When that specific instance of the agent hits its limit and dies, the next agent in the relay simply reads the summary file and picks up exactly where the last one left off. It’s an approach that has been popularized by Anthropic in Claude.

Mental housekeeping: Compaction and summarization

Developers are inherently lazy. When we use AI coding assistants, we rarely start fresh, clean chat threads. We just keep iterating in the same window.

For an agent, this is fatal. If you asked an agent to read five files 20 minutes ago, those files - and the long tool-call outputs associated with them - are still sitting in its context window, silently degrading its ability to reason about your current request, costing users time and money..

To keep agents running indefinitely, the harness must perform mental housekeeping. We do this through compaction and summarization.

HD-3

Compaction is the act of automatically trimming the fat. Every few turns, the harness quietly reaches into the agent's context and deletes old, bulky tool responses, replacing them with a tiny note that says: "If you need this result again, it is saved in [File X]."

Notice how the context length drops sharply during regular “compaction” phases, and resets entirely during a “summarization”.

When compaction isn't enough and the context window inevitably fills up, we trigger summarization. The harness pauses the agent, hands its entire messy context to a secondary, cheaper model, and says something like: "Summarize what has been done and what needs to happen next." The harness then wipes the main agent’s memory entirely, inserts only that brief summary, and lets it start fresh.

Defending the context layer with subagents

A good manager doesn't get bogged down in the weeds of a menial task; they delegate it. Deep agents should do the same.

When a long-running agent encounters a massive, complex sub-task - like running an exploration algorithm or testing a new piece of code - it shouldn't do that work in its main reasoning loop. That will cause massive context rot.

HD-4

Instead, the harness should allow the main agent to spawn a temporary copy of itself.

The main agent writes a specific prompt for this "subagent," spins it up in an isolated environment, and waits. The subagent does the heavy lifting, burns through thousands of tokens, finds the answer, returns only the final result to the main agent, and then terminates.

This is how you defend the primary context layer. You keep the main agent's mind clear, focused solely on high-level orchestration, while disposable subagents take the cognitive hit of the dirty work.

Build less, understand more

Recently, I was talking to a developer who worked on Manus AI (an autonomous system recently acquired by Meta). He summarized the future of this space perfectly: "Build less, understand more".

For a long time, the instinct in AI engineering has been to micromanage the models. We write labyrinthine, 5,000-word system prompts outlining every possible edge case, hoping to control the agent's behavior.

But long-running autonomy doesn't come from a perfectly engineered prompt. It comes from the harness.

If you want an agent that can work for 30 days straight, stop trying to shove the entire world into its context window. Give it a file system. Give it a terminal. Give it the ability to delegate to subagents, summarize its own thoughts, and offload its memory to a sandbox.

If you give an AI the right environment, you won't need 10 million tokens for it to change the world.

×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more

Get the latest posts straight to your inbox

No matter what data type you're looking for, we've got you

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026