Claude Sonnet 4.6 is the new best model for writing scrapers

Claude Sonnet 4.6, yesterday’s latest release from Anthropic, is widely seen as a strong, balanced all-rounder for coding and reasoning.

But when it comes to a very specific job, generating robust web scraping extractors, it actually beats the models that were considered state-of-the-art at programming just a week ago, even Opus 4.6.

When we last ran our Web Scraping Copilot benchmark in November, Gemini 3 Pro was the clear winner, combining best-in-class scraping code quality with low complexity.

Since then, Anthropic has released two new models. So, when Sonnet 4.6 dropped on February 17, Zyte’s R&D team re-ran our test with an expanded dataset and updated scoring.

The verdict: A new state-of-the-art

In our updated evaluation, Claude Sonnet 4.6 produced the best overall data extraction quality of all models we used in Web Scraping Copilot, Zyte’s new Visual Studio Code extension designed to help data engineers use AI to build extractors faster.

The gap is not huge. Gemini 3 Pro is essentially neck-and-neck. But Sonnet 4.6 now edges it out on our main metric - output quality, and basically matches it in code complexity.

But there is another factor where Sonnet 4.6 beats Gemini 3 Pro hands down: it is 3.5x faster at producing the code, which is critical for sessions in an IDE - after all, there are only so many coffee breaks one can take in a day!

The surprise: Sonnet beats Opus

What surprised us most: Sonnet 4.6 performs substantially better than Claude Opus 4.6, launched on February 5, even though Opus is often considered the strongest Claude model for coding and Opus 4.6, in particular, is widely considered to be a game-changer in the industry.

That’s because most improvements we have seen in recent new models have been in agentic coding, while our benchmark stress-tests HTML understanding and long context, as well as coding skills generally.

Sonnet 4.6 also makes a big jump over its predecessor Sonnet 4.5, despite being only one version newer.

This suggests the gains are not just incremental tuning. Something meaningful changed for this particular class of tasks.

The benchmarks: How we measured it

To see which model truly writes the best scraping code, we measured them across three key engineering metrics:

ROUGE-1 F1 (adjusted): Our main metric for measuring code quality generated inside Web Scraping Copilot. Higher is better. Specifically, we generate code to extract data with Zyte API’s product and productNavigation schemas.
SLOC (Source Lines of Code): A measure of verbosity. We calculate how much executable code is generated per field. In scraping, concise code is generally more robust and easier to read and maintain. Lower is better.
Time per extractor (seconds): How much time it takes to generate and test the full extractor. Lower is better.

Here are the headline results from the current run:

Model	SLOC per attribute	rouge1_f1_adj	Time per extractor (secs)
claude-sonnet-4.6	19.59	0.8348	171
gemini-3-pro	19.16	0.8330	611
claude-opus-4.6	19.97	0.8202	170
gemini-3-flash	23.45	0.8165	214
gpt-5.2-codex	26.28	0.7746	154
gpt-5-mini	46.82	0.7467	173
claude-sonnet-4.5	18.95	0.7024	153

A note on comparability: this evaluation is not directly comparable to the one we published previously, because we expanded the dataset we used to test LLM scraping and we adjusted the metric calculation to weigh all tested extractors equally.

What are we really testing?

To be clear - you don’t get any of these results from just using these bare LLM models in your plain code editor. None of them achieve these results on their own, because no major LLM model is tuned for the unique demands of web scraping.

Rather, these results are achieved by using the models in Web Scraping Copilot, Zyte’s free Visual Studio Code extension for building and managing Scrapy spiders.

Web Scraping Copilot integrates with GitHub Copilot and includes Zyte’s secret sauce - specialist scraping know-how that guides LLMs to generate the kind of extraction code that data engineers actually need, including auto-generating parsing code for target pages.

Which model should you pick?

Which model should you use inside Web Scraping Copilot? The answer today is clear:

If you want the absolute best extraction quality today, Claude Sonnet 4.6 is our top pick based on this benchmark, as it provides the best quality of extracted values, concise code and is fast.
If you want to save on costs, you can consider using Gemini 3 Flash through its API, or GPT-5-mini through your Github Copilot subscription - but this sacrifices quality.

New models come out every week. In Web Scraping Copilot, you have a choice, because Web Scraping Copilot is model-agnostic. You are not locked into a single LLM. You can switch models depending on whether you are optimizing for quality, simplicity, or cost.

The cost equation

What we cannot recommend based on these results is Claude Opus 4.6 for this particular use case. It does not outperform Sonnet 4.6, and it is more expensive for those using Web Scraping Copilot directly with their own LLM API keys after exceeding their GitHub Copilot usage.

On cost, Claude Sonnet 4.6 and Gemini 3 Pro land in a similar range. If you are bringing your own API key, Sonnet 4.6 is cheaper per domain in this benchmark run.

Costs in your actual usage will heavily depend on complexity of the data schema - our dataset is mostly based on a complex product schema.

Model	API cost per domain ($)	Premium requests per domain
gpt-5-mini	$0.39	0
gpt-5.2-codex	$2.35	55.2
gemini-3-flash	$0.95	51.3
gemini-3-pro	$5.67	55.7
claude-sonnet-4.5	$3.49	50.5
claude-sonnet-4.6	$3.94	53.2
claude-opus-4.6	$5.74	148.9

Costs in your actual usage will heavily depend on complexity of the data schema - our dataset is mostly based on a complex product schema.

Next up, we are looking forward to testing OpenAI’s GPT-5.3-Codex when it's released in the OpenAI API.

Get Web Scraping Copilot

If you haven’t yet discovered Web Scraping Copilot, install it now:

Check out more Web Scraping Copilot content: