PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI-assisted data extraction28, 28 articles
Data gathering for AI6, 6 articles
Large Language Models (LLMs)24, 24 articles
Tool-assisted coding3, 3 articles
Developer interest143, 143 articles
Integration13, 13 articles
Open-source96, 96 articles
Scraping practice59, 59 articles
Scraping strategy46, 46 articles
Anti-ban35, 35 articles
Traffic6, 6 articles
Web data application25, 25 articles
Web data collection358, 358 articles
Web data collection ethics3, 3 articles
Web data collection legality16, 16 articles
Web scraping APIs63, 63 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogAI-assisted data extractionWhy AI agents struggle with web scraping (and how to help them)
ArticleTutorial / How-toAI-assisted data extraction

Why AI agents struggle with web scraping (and how to help them)

Explore the key challenges AI agents face in web scraping and how Zyte’s Web Scraping Copilot boosts automation, accuracy, and developer productivity.

I

Ivan Sanchez

10 min read · November 17, 2025

Why AI agents struggle with web scraping (and how to help them)

Industry analysts have placed AI agents right at the peak of the technology Hype Cycle, with concerns emerging about inflated expectations and a forecast that almost 40% of agentic projects will get scrapped - but they are not a passing trend.

An agent is simply a software entity that observes its environment, reasons on that observation and acts upon the environment.

Ten years ago, that "reasoning" step might have been just some hard-coded software rules. The game-changer today is that agents now have the powerful brain of Large Language Models (LLMs) to handle the reasoning part. This is what is bringing agents to a whole new level.

The adoption rate is staggering. The 2025 Stack Overflow Developer Survey shows that 84% of respondents are using or planning to use AI tools in their development process, an increase from 76% the previous year. At this rate, we'll soon be at a point where nearly every developer is using AI tools. This is a massive time saver and something that companies shouldn't be underestimating.

Web scraping is still hard for agents

But, while AI agents have proven to be a massive boon for general software development, many are asking if that impact translates directly to the world of web scraping. The simple answer, right now, is “no”.

Web scraping presents a set of unique and difficult challenges that push current AI agents to their limits. The complex, dynamic, and intricate nature of the web makes it a struggle.

Here are some of the core problems:

Challenge

Long HTML

Access restrictions

Long "Trial and Error" Loops

Adaptability

Reproducibility

Scalability

Human-in-the-Loop

Safety

Why it's a problem for Agents

The sheer size of modern web pages is a major hurdle. With many modern web pages containing hundreds of thousands of tokens of HTML, and some exceeding a million, target HTML content often exceeds the context window of even the most powerful LLMs. An agent can't reason about what it can't see.

What do you do when you want to scrape something? You open a browser, go to the page, and—bam—you get a CAPTCHA. For an agent, that's game over. Overcoming bans, CAPTCHAs, and geo-limitations is a fundamental requirement that stumps many autonomous systems.

Without a clear path, agents can get stuck in long, repetitive loops of trying something, failing, and trying again. This is not only inefficient but also quickly exhausts the agent’s limited context, causing it to lose track of its original goal.

Websites change constantly. One day, your spider works perfectly; the next, a small front-end update has broken all your selectors. We humans can adapt to these changes almost instantly, but an agent needs a robust monitoring and adaptation mechanism to handle this dynamism.

With non-deterministic elements like A/B tests, client-side rendering, and lazy-loading, the same URL can present different HTML on subsequent requests. This lack of consistency makes it incredibly difficult for an agent to create a reliable, reproducible scraping process.

It's one thing to use an agent to help write a single spider. But how do you scale that to write 1,000 custom spiders? An interactive, one-at-a-time process doesn't scale.

Given the issues above, some human oversight is often necessary. But how often? Finding the right balance between automation and manual intervention is a non-trivial trade-off between flexibility and efficiency.

How do you ensure an agent doesn't do something it's not supposed to? The risk of prompt injection attacks and other security vulnerabilities is a serious concern when granting an agent autonomous capabilities.

Helping agents scrape: Zyte’s research and experience

At Zyte, we've been tackling these challenges head-on. In late 2024, we started by creating our own "mini-agent" to automatically write data extraction code. The workflow went like this:

This API was used internally by our developers, who found it would save significant time. But, with standalone agents becoming more powerful, we asked ourselves: “Is this ‘mini-agent’ app still worth it? Couldn’t we just ask a powerful agent like GitHub Copilot to do the whole of the scraping writing?”

We tried it and found that even the most powerful agent with such big instruction files is not able to even come close to what real end-to-end fully autonomous scraping should look like. For example, the agent would often ignore the specific data schema format we provided.

This led us to a better approach: delegation.

All of these issues are solved if we allow an agent through GitHub Copilot to delegate the hard, specific parts of the task—like code generation—to our standalone "mini-agent".

But then the question was: “How do we make that tool available to the main agent?” Putting it behind an API and requiring users to sign up and use another key felt clunky. We wanted the code editor to have direct access to the code as it was being written.

MCP sampling with local efficiency

The solution we found was MCP sampling.

The idea is simple but powerful. Previously, our server was making its own (expensive) calls to an LLM provider like OpenAI.

Instead, we package the logic of our "mini-agent" so it can run locally on the user's machine. The tool then routes its requests through the LLM a user’s GitHub Copilot already has access to. The client "samples" the prompts from our tool, gets the responses, and returns them. This makes the process cheaper and more transparent, as all the code is running locally.

Of course, this approach has limitations. The LLM completions provided by tools like VS Code have different context limits than the native models. For example, GPT-5 in VS Code Copilot has a max context of 128k tokens, whereas the OpenAI API version has 400k.

This is why our internal process of HTML simplification is so critical to making the workflow successful.

Introducing Web Scraping Copilot

This work has led us to release Web Scraping Copilot, a free VS Code extension that helps you generate web scraping code with GitHub Copilot.

It streamlines working with Scrapy projects and includes optional integration with Scrapy Cloud, making it easier to deploy and monitor your web scraping jobs.

For us, at Zyte, getting to “simple” was hard. We had to refactor, for Visual Studio Code, a complete project that was already being used locally. We had to face the need to integrate at the same time as adopting.

Now, however, Web Scraping Copilot bundles all our above work, including HTML simplification.

Zyte test developers have reported a huge increase in productivity.

"In the past, when I needed to create a spider from scratch using a standard product schema, it took me eight hours. Now, in two hours I'm done. [Web Scraping Copilot API] makes a massive difference."

Test user, Web Scraping Copilot

Going beyond: The future of agents in web scraping

This is just the start. At Zyte, we are constantly researching and developing to take advantage with the incredible speed of AI innovation. This includes:

  • Dynamic crawling planner: We're working on a system that can create a plan for crawling complex websites based on a simple user query. This would be a great starting point for agents to then create the spiders.

  • Improving Web Scraping Copilot: We continue to refine our tool to produce higher quality, more customizable code that adapts to a user's existing codebase and needs.

  • Adapting our existing tools: We are adapting our powerful open-source tools, like web-poet, to be more usable by LLMs. For example, Zyte's standard data schemas now have an llmHint parameter to provide better guidance to agents.

  • Alternative approaches: We're exploring alternative ways to structure tools and MCPs with different scopes, allowing for more advanced capabilities like automatically finding the best browser configuration for a given website.

Research drives relevance

Web scraping is hard, even for agents. The challenges are numerous and complex. However, delegating these hard, long tasks to specialized sub-agents or tools is a very effective strategy. We found that trying to make one monolithic agent do everything was less successful than a modular approach.

This field is evolving very, very quickly. New capabilities are possible with every update. It's critical for anyone in this space to stay tuned, play around with new features, and not get left behind.

At Zyte, we are working closely with these new technologies. Our goal is to apply them to automate fully customizable web scraping, tackling the hard problems so you don't have to.

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
AI-assisted data extraction
I

Ivan Sanchez

More from this author

In this article

  • Web scraping is still hard for agents
  • Challenge
  • Why it's a problem for Agents
  • Helping agents scrape: Zyte’s research and experience
  • MCP sampling with local efficiency
  • Introducing Web Scraping Copilot
  • Going beyond: The future of agents in web scraping
  • Research drives relevance

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026