Super-powers, toll booths and the new era of data collection

In the web data-gathering industry, the relationship between scrapers and website operators is paramount.

Over the last couple of years, both parties have benefitted from improved tooling.

So, what is the state of that relationship right now, and how is it likely to evolve?

The rise of anti-bot technology

These days, many websites are using sophisticated tools to discourage web data extraction by identifying key signals.

The network fingerprint: Hard to use, easy to detect

A key battleground is the network fingerprint.

Identifying network fingerprints is easy. With a wealth of open-source libraries, anyone can analyze IP addresses, TCP/UDP patterns, and TLS handshakes.

But on the scrapers’ side, it's a completely different story.

While obtaining clean IP addresses is no longer a major hurdle for data extraction professionals, thanks to affordable residential proxies, showing lower-level protocols is another matter. Customizing TCP/UDP stacks requires writing complex, kernel-level code in languages like C++, demanding deep systems programming expertise. It’s not impossible, but it requires a lot of resources. The same is true for TLS and HTTP/2 and HTTP/3 layers.

It’s no mean feat, meaning anti-bot systems are succeeding in this challenge.

The browser battlefield

Years ago, anti-bot systems began using JavaScript to collect hundreds of browser-based signals. In response, powerful open-source tools like Camoufox emerged. By modifying Firefox at the C++ level, it allows users to directly customize browser fingerprints like Canvas and WebGL signatures.

Because this is open source, anti-bot companies have already analyzed these signatures and now work against them. For extractors, the optimum solution may be to build their own custom browser - but this would require a team of experts and a massive investment.

Meanwhile, anti-bot tooling continues to innovate with new techniques like audio fingerprinting, which analyzes unique hardware signatures from a user’s audio stack.

In other words, the complexity and cost of effective browser customization are giving the defense systems a clear advantage.

Bots learn new tricks

Anti-bot technologies have become increasingly effective stewards of their operators’ websites. But artificial intelligence tooling is also giving smart scrapers an effective new way to continue scraping.

The CAPTCHA problem

CAPTCHA puzzles frustrate crawlers even more than they do casual users. However, today, by combining LLMs with sophisticated automation, most CAPTCHAs can be managed effectively.

Anti-bot systems are upping the ante:

Some sites are experimenting with more outlandish CAPTCHA variants, like webcam-based gestures including hand-waving, which certainly challenges user experience.
By analyzing engagement time, researcher Elisa Chiapponi is even experimenting with a way of detecting the strange phenomenon of “CAPTCHA farms”, a type of service that uses a large group of low-paid human workers to manually solve CAPTCHAs on behalf of malicious systems.

However, novel solutions such as these are far from ready for prime-time release into a market that has got used to handling CAPTCHAs effectively.

Our new super-hero: AI-powered scraping and maintenance

Websites evolve constantly, changing designs and restructuring HTML. This creates a maintenance nightmare, made worse by A/B testing that results in multiple versions of the same site.

But generative AI is scrapers’ new super-hero. Instead of just asking an LLM to directly extract data from a single page - which is too slow and expensive - we can ask it to write and, more importantly, debug scraper code for us.

For instance, I've developed an MCP (Model Context Protocol) server called the scrapy-inspector that creates an automated debugging loop for Scrapy spiders:

Run crawler: The agent runs a broken spider and confirms it fails.
Capture context: A middleware records all requests, responses, and headers from the failed crawl.
Evaluate and diagnose: The LLM uses this recorded data to test selectors and expressions, pinpointing the error without re-running the entire crawl.
Fix crawler: The LLM automatically modifies the spider's code to fix the issue.

This agent iterates through the loop until the spider is fully functional. I don’t have to touch a single line of code. A task that might have taken me weeks in the past can now be completed in minutes. That’s an efficiency game-changer.

The state of play

So, what does the relationship between website owners, data gatherers and their technologies look like right now, and where are we going?

Cost intensification

First, the growing sophistication at play here, on both sides, means building a production-ready scraping stack now requires a massive investment in custom browsers, clean proxies, and expert teams. If you want to gather data at scale under your own steam, that initial investment is becoming punitive.

The barrier to entry is becoming impossibly high for smaller players. Only those with deep pockets can truly compete at scale. No wonder businesses are turning to dedicated scraping vendors.

New terms of engagement

Second, we may be moving toward a kind of a closed internet. Companies with data understand that AI providers don’t only represent a challenge, they are also an opportunity.

Instead of playing endless cat-and-mouse games and trying to block them, many now are trying a different approach; they want to make deals. By building authentication systems, they create the mechanism to turn data access into a valuable economic relationship.

Major websites may become accessible to the most important AI agents that have brokered deals, but others could be denied admission.

For all sides of the industry, the tooling improved greatly in 2025. Let's see what 2026 brings us.