PINGDOM_CHECK
Light
Dark

Agentic web scraping: Hype, reality and what happens next

Read Time
2 mins
Posted on
September 9, 2025
From broken parsers to context limits, today’s AI agents have real challenges—but with the right tools and orchestration, they could reshape how we extract web data.
By
Konstantin Lopukhin
Table of Content

Three years after they first burst on to the scene, we are now in a world where Large Language Models (LLMs) aren’t just for chatting anymore.


“Agents” promise to think, plan and act on our behalf in the real world, as well as to verify those solutions – whether it be for writing and running code, or for everyday tasks.


Instead of the user laboriously pasting errors back and forth, an agent itself can iterate by itself – writing code, running tests, checking outputs, and improving its solution in a feedback loop.


When I first started playing with AI agents, I wasn’t thinking about replacing developers, or about the hype cycles that tend to follow new AI tools. I was thinking about something much simpler: could these systems actually help us do web scraping better?

Browser agents’ growing pains


At first glance, web scraping should be a natural fit for agents: it’s a process that’s tedious to code by hand, highly repeatable, and benefits from automation. But in its first wave, agentic web scraping faces several challenges.


Bots block bots


OpenAI made waves with the release of its consumer-grade Agent tool, which can use a virtual computer and browser by itself.


Our team has reviewed its potential for web data collection, but found that it suffered from being blocked by many sites we sent it to.


That is because it is using a remote web browser which is easy to detect and cannot currently handle the sophisticated anti-bot technologies being adopted by a growing number of sites these days.


The trouble with scale


Access is not agents’ only challenge when it comes to scraping. Scale matters, too.


If you only need to gather a few bicycle prices for comparison, getting an agent to browse on your behalf makes perfect sense.


But most teams doing web data extraction need to gather many fields across many sites, refreshed regularly. Pointing an LLM-driven browser at every page, every time – is slow, brittle, and costly. When it comes to scraping, you actually don’t want your computer to mimic a human.

Can coding agents write scrapers?


Where browser-operating agents fall short, LLMs’ growing software development skills open the door to a more promising approach: generating your scraping code themselves.


General-purpose coding agents like OpenAI’s Codex and Anthropic’s Claude Code can be genuinely useful for certain kinds of tasks—especially those that don’t fall directly in the developer’s comfort zone.


I may be skilled at machine learning engineering in Python but, if I need to do UI development in Angular, the agent lets me develop and test changes in a way I could not otherwise have done before.


Extra help required


Could agents help start web scraping projects by generating relevant crawlers, extractors, and tests?


We tried giving Codex and other agents real scraping tasks. We’ve found meaningful improvements by giving agents specialized tools: proprietary scraping libraries, context compaction strategies, and splitting document handling into external tool calls.


So far, we are not using a wholly agentic approach but, rather, specific workflows and a system of orchestrating multiple LLMs to generate extraction code, to squeeze the most out of each model.


Token trouble


When you ask an LLM to extract content from a web page, the process burns “tokens” on both input and output ends – the AI needs to ingest the entire web page, analyze it and re-generate the desired output. LLM calls are typically the primary cost driver in an agentic workflow.


But HTML documents can run to megabytes in size—well beyond the context window of even the best LLMs. And, even when HTML documents fit into the context, generating robust web scraping code requires analysis of multiple web pages, so an agent inevitably runs out of context.


While an agent might use tools like "grep" to selectively read chunks of input documents, such tools are poorly suited to extracting data from HTML documents, as it requires a global understanding of a page’s structure, and ability to traverse the non-local tree structure of the document. 


At Zyte, we have resorted to radically simplifying the processed HTML, removing extraneous mark-up to make it much smaller. And since we're orchestrating multiple LLMs, we can ensure each of them receives just the context it needs to solve its task.


Domain (expertise) not found


The finer points of end-to-end scraping remain a tall order for general-purpose coding agents.


An experienced web scraping engineer knows how to design the most efficient strategy for data collection. They would inspect network activities, search for direct API opportunities, craft the most reliable selectors, spot embedded JSON, keep session management in mind, and invoke browser rendering only when necessary.


The most advanced coding agents aren’t yet equipped with that domain-specific knowledge or toolset.

Zyte’s way ahead


When it comes to scraping at scale, agentic effectiveness may be just around the corner. Right now, these tools are still best seen as a supplement, not a replacement for engineering skill.


Their outputs demand review. It’s rare—though not unheard of—for an agent to deliver a perfect, end-to-end solution unaided.


We are intensely interested in the potential for agentic scraping, which could be a huge accelerant for our customers.


Should we push ahead with fully agentic workflows, accepting that the quality will sometimes fall short, or should we maintain human-supervised control, guaranteeing quality but slowing down the march toward autonomy?


The trade-offs are real. Customers with strict accuracy requirements won’t tolerate faulty data.


We have decided to start with quality, then scale through autonomy.


We lean on specialized orchestration: simplifying HTML while preserving key elements, offloading distinct scraping tasks to separate MCP tools, and ensuring the main agent’s context remains unpolluted. This lets us control exactly what each model sees and how it’s processed.


It’s a more “hard-coded” approach today—but one we can gradually open up to more agentic control as the technology matures, without lowering the quality bar our customers depend on.

The road to agentic scraping


The future many are dreaming of involves pointing your agent at data and sitting back, while it retrieves it, efficiently, at scale, without you worrying about bans or site changes.


Getting there will require:


  1. Equipping agents with robust, domain-specific, scraping MCP tools and training them to orchestrate with minimal human intervention.

  2. Implementing context engineering to ensure agents know which “memory” matters.

  3. Building tools and interfaces that go beyond chat—combining code inspection with live previews of extracted data, visualising navigation paths, and offering quick controls to adjust strategies and selectors


Agents can deliver real value for web scraping when you play to their current strengths and evolve the architecture step by step.


With targeted engineering, robust quality guardrails, and a nimble approach to adjusting autonomy, they can grow into a powerful force for extracting web data.


This is the vision Zyte is building toward: making web data collection faster, more flexible, and more reliable than ever—whether the work is done by a human, an agent, or both working in tandem.

×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.