Can coding agents write scrapers?
Where browser-operating agents fall short, LLMs’ growing software development skills open the door to a more promising approach: generating your scraping code themselves.
General-purpose coding agents like OpenAI’s Codex and Anthropic’s Claude Code can be genuinely useful for certain kinds of tasks—especially those that don’t fall directly in the developer’s comfort zone.
I may be skilled at machine learning engineering in Python but, if I need to do UI development in Angular, the agent lets me develop and test changes in a way I could not otherwise have done before.
Extra help required
Could agents help start web scraping projects by generating relevant crawlers, extractors, and tests?
We tried giving Codex and other agents real scraping tasks. We’ve found meaningful improvements by giving agents specialized tools: proprietary scraping libraries, context compaction strategies, and splitting document handling into external tool calls.
So far, we are not using a wholly agentic approach but, rather, specific workflows and a system of orchestrating multiple LLMs to generate extraction code, to squeeze the most out of each model.
Token trouble
When you ask an LLM to extract content from a web page, the process burns “tokens” on both input and output ends – the AI needs to ingest the entire web page, analyze it and re-generate the desired output. LLM calls are typically the primary cost driver in an agentic workflow.
But HTML documents can run to megabytes in size—well beyond the context window of even the best LLMs. And, even when HTML documents fit into the context, generating robust web scraping code requires analysis of multiple web pages, so an agent inevitably runs out of context.
While an agent might use tools like "grep" to selectively read chunks of input documents, such tools are poorly suited to extracting data from HTML documents, as it requires a global understanding of a page’s structure, and ability to traverse the non-local tree structure of the document.
At Zyte, we have resorted to radically simplifying the processed HTML, removing extraneous mark-up to make it much smaller. And since we're orchestrating multiple LLMs, we can ensure each of them receives just the context it needs to solve its task.
Domain (expertise) not found
The finer points of end-to-end scraping remain a tall order for general-purpose coding agents.
An experienced web scraping engineer knows how to design the most efficient strategy for data collection. They would inspect network activities, search for direct API opportunities, craft the most reliable selectors, spot embedded JSON, keep session management in mind, and invoke browser rendering only when necessary.
The most advanced coding agents aren’t yet equipped with that domain-specific knowledge or toolset.