PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI-assisted data extraction28, 28 articles
Data gathering for AI6, 6 articles
Large Language Models (LLMs)24, 24 articles
Tool-assisted coding3, 3 articles
Developer interest143, 143 articles
Integration13, 13 articles
Open-source96, 96 articles
Scraping practice59, 59 articles
Scraping strategy46, 46 articles
Anti-ban35, 35 articles
Traffic6, 6 articles
Web data application25, 25 articles
Web data collection358, 358 articles
Web data collection ethics3, 3 articles
Web data collection legality16, 16 articles
Web scraping APIs63, 63 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogDeveloper interestWeb Data Extract Summit 2024: What Did You Miss?
ArticleNewsDeveloper interest

Web Data Extract Summit 2024: What Did You Miss?

What did you miss from the sixth Web Data Extract Summit in Austin, TX? Find out business and technical insights from experts in web data extraction.

Theresia Tanzil · Content Writer

11 min read · January 3, 2025

Web Data Extract Summit 2024: What Did You Miss?

Web Data Extract Summit 2024 Recap

The 2024 Web Data Extract Summit celebrated both its debut in Austin, USA and its sixth year since launching in 2019.

The two-day event began with a day of hands-on technical workshops, followed by an action-packed second day of curated sessions:

  • Four talks explored AI-related applications of web data,

  • Two talks addressed business strategies for leveraging web data,

  • Three sessions focused on the infrastructure driving web scraping operations, and

  • Two sessions delved into legal, ethical, and compliance considerations.

Before we move on, let’s acknowledge that you may be exhausted from reading and hearing about AI at this point. However, its potential to impact your bottom line is difficult to ignore especially if you’re leveraging web data extraction in any capacity.

What are others doing with their data extraction practices? What opportunities could you be missing? And most importantly, what should you be doing differently to stay ahead?

We hear you—and we get the AI fatigue. That’s why our team at Zyte made sure that we cover all the bases, from technical, business, and legal aspects of web scraping when we curated this year’s Extract Summit lineup of 11 talks.

Each technical talk offered unique perspectives and complemented the others well. However we noticed a couple of recurring themes. Keep reading to find out what those are.

To provide a clear overview, we've divided this recap into two main sections: one focusing on technical insights and the other on business implications.

Technical Insights for Developers Doing Web Data Extraction

Here are the five recurring topics across the technical talks:

  1. Infrastructure as a service: We now have increasingly-intelligent and distributed computational infrastructure, all available on-demand. There are three talks on this topic from proxies, browsers, and distributed compute.

  2. Tapping into AI: We get a view of how AI changes the economics of build vs buy at Zyte and how Neelabh Pant’s team at Walmart uses AI agents to streamline their data pipeline orchestration.

  3. Managing LLM costs: At this moment, LLMs + HTML = pricey. How to best manage this? How does this change with multimodal models? These are some of the motivating questions that Iván Sánchez from Zyte and Asim Shrestha from Reworkd unpacked in their talks.

  4. The importance of domain-specificity when approaching prompt engineering and writing evals. To unlock the potential of LLMs for your data, you still need people with domain knowledge. Here we hear Neelabh share the one technique that landed his team at a sweet spot by designing domain-specific prompts and leveraging AI agents. And then we have Asim from Reworkd highlighting the importance of writing domain-specific evals to simplify the problem space.

  5. Retrieval-augmented generation (RAG): Neelabh highlighted how RAG helped his team in identifying top similar products, stressing the need for careful experimentation with the number of items retrieved to maintain contextual accuracy and relevance. Jan from Apify then positioned RAG as a game-changer for commercial LLM applications. He demonstrated a website content crawler that is integrated with RAG pipelines and a vector database backend, Pinecone.

Ready to zoom in? Let’s—but not without a map.

If we think of the web data extraction stack as a layered structure, then we can map the different sessions onto its key components. Here’s a visual breakdown of how these sessions align:

You can also watch the respective sessions for each topic on demand here.

Topic

Infrastructure for ban handling > Proxy management

Infrastructure for ban handling > Proxy management

Infrastructure for ban handling > Proxy management

Data pipeline and workflow

Data crawling and extraction

Rendering and interacting

Data processing

Session

The Future of Proxy Technology: Trends and Innovations in Residential, Mobile & DataCenter Proxies - Panelists

Cache, Cookies, Reconnects: Accelerate scrapes with session management - Joel Griffith, Browserless

Distributed Intelligence for Distributed Data - Matthew Blumberg, Charity Engine

Harnessing the Power of Large Language Models for Advanced Data Engineering and Data Science - Neelabh Pant, Walmart

Advanced techniques and innovations for extracting specific data attributes from diverse sources - Iván Sánchez, Zyte

Enabling Large Language Models (LLMs) agents to understand the web - Asim Shrestha, Reworkd

How to feed Large Language Models (LLMs) with data from the web - Jan Čurn, Apify

Watch if you’re interested in hearing about

  • • Trends and challenges in proxy usage

  • • How IP reliability are managed and how fraud scoring work

  • • How geolocation databases are used to conduct geolocation and IP management

  • • Implications that unethical proxy service providers have on the market and what you need to be aware of as a user of these services

  • • Three techniques for managing browser automation (cache, cookies, and process management)

  • • A poor man’s way to scale using Chrome

  • • Good use cases for using a cache in web applications

  • • The advantages of using cookies compared to user data directories

  • • Considerations should be made when choosing a caching strategy

  • • What tool Joel used for load balancing and route requests

  • • Some of the unique features of Charity Engine’s large-scale distributed computing resources

  • • Flexible development options that developers get access to

  • • How Docker and WebAssembly files play into this.

  • • Who can donate compute and who can purchase the compute

  • • How Charity Engine ensures data privacy and security when using volunteered computing resources

  • • What systems and processes are put in place to detect and prevent misuse such as DoS attacks

  • • The limitations of traditional data processing methods

  • • How Neelabh’s team use LLMs to address common issues found in data engineering for retail datasets such as missing categories and descriptions

  • • How category-specific prompts are constructed for feature extraction

  • • The ethical considerations when using LLMs to generate or impute missing data, especially in sensitive industries

  • • Why we want to consider using LLMs for data extraction when traditional scraping methods exist

  • • How Ivan optimized for token usage and how it affects cost and performance

  • • What the ROUGE metric is

  • • Findings regarding fine-tuning versus in-context learning for LLMs

  • • Why quality data trumps noisy data in quantity for LLMs training

  • • The current limitations of running LLMs at scale

  • • How Ivan mitigated the hallucination problem in LLM-based web data extraction

  • • The limitation of using raw HTML for webpage parsing

  • • The importance of visual cues in web pages and the limitations of HTML alone

  • • Reworkd’s implementation of 2D rendering algorithm that leverages OCR to transform web pages into structured strings (and it’s open source!)

  • • The importance of evals for iterating and improving on web tasks and why Reworkd AI developed and released Bananalyzer to tackle this issue

  • • The Value of Retrieval-Augmented Generation (RAG) for LLM Applications

  • • The process of building a customer support agent with LLMs and RAG

  • • How to use Apify’s website content crawler and Pinecone integration to scrape data and store it in a vector database for use in RAG pipelines

  • • Why it is important to strip down HTML content when scraping

Joachim offered three key actionable recommendations valuable for anyone involved in implementing a web data extraction stack.

  1. AI-Assisted Data Cleaning: Use LLMs to assist in cleaning data by identifying and removing sensitive information like names and phone numbers.

  2. Privacy, Anonymisation, and Bias Mitigation: Prioritize filtering out Personally Identifiable Information (PII) during the data collection stage. This involves more than just removing usernames but also thorough examination of the content to ensure no sensitive information is inadvertently included. Be aware of potential biases in scraped data, such as demographic overrepresentation. Techniques like word clouds can help identify biases.

  3. Data Security and Privacy Practices: Use techniques like differential privacy and human-in-the-loop systems to improve data handling processes.

He also delved into the challenges of using publicly scraped data, particularly the risk of a model memorising specific data points, which could compromise privacy and violate ethical guidelines. He highlighted key considerations for deploying models in low-resource contexts, where constraints like limited computational power and sparse training data demand creative and efficient solutions.

These thought-provoking questions were also addressed during the talk:

  • Should companies behind LLMs make their models open source to foster transparency and community collaboration?

  • Can niche models be improved by retrofitting context using datasets with similar themes, thereby enhancing their applicability to specialised tasks?

  • How can overfitting be mitigated when incorporating human-in-the-loop feedback, ensuring the model remains generalisable while benefiting from nuanced corrections?

Business Insights For Business Leaders Buying Web Data

If you're a business leader working with web data, these sessions are a must.

Session

Web Data Extraction Mastery: Real-World Implementations and ROI-Driven Success Stories

How We Transformed Zyte's Data Business with Cutting-Edge AI Technology

The Future of Proxy Technology: Trends and Innovations in Residential, Mobile & DataCenter Proxies (Panel)

Distributed Intelligence for Distributed Data

A Practical Demonstration of How to Responsibly Use Big Data to Train LLMs

Navigating the Legal Landscape of Web Data Extraction (Panel)

Speaker(s)

John Fraser | Founder at Parts ASAP

Iain Lennon | Chief Product Officer at Zyte

Jason Grad (Massive), Neil Emeigh (Rayobyte), Ovidiu Dragusin (Servers Factory), Shane Evans (Zyte), Tal Klinger (The Social Proxy), and Vlad Harmanescu (Pubconcierge)

Matthew Blumberg | Co-founder at Charity Engine

Joachim Asare | AI/ML Engineer & Master’s in Design Engineering at Harvard University

Sanaea Daruwalla (Zyte), Hope Skibitsky (Quinn Emanuel), Stacey Brandenburg (ZwillGen), and Don D'Amico, (Glacier Network and Neudata)

Watch if you’re interested in hearing about

  • • The challenges of in-house web data extraction

  • • What factors John considers important when looking for a partner for web scraping

  • • Why PartsASAP employs human verification in the process of matching product details

  • • What a streamlined process adopting a standardised schema allowed PartsASAP to do

  • • What contributed to PartsASAP’s 20% YoY growth

  • • Why John advocated for a “collect now, analyse later” approach

  • • What John means by “consistency over reactivity”

  • • How the cost of data extraction changed over time.

  • • The different cost structure of generative AI and custom code in handling less structured, “wicked” problems

  • • The tradeoff between freedom and schema in AI

  • • How to control the costs associated with using large language models

  • • The overarching problem that Zyte aims to solve with composite AI

  • • What impact high setup costs of data products has on customers

  • • Impact and implications of unethical proxy services to the customers and the industry.

  • • Methods used by ethical IP companies to source, monitor, and ensure compliance.

  • • Trends and challenges in proxy usage, e.g. success rates of data center IPs, residential, and ISP proxies.

  • • The complexities of managing IP geolocation.

  • • Ideas of how web scraping and infrastructure can be used for good. From genomic sequence screening, to drug candidate identification

  • • Details of Charity Engine’s socially conscious business model

  • • How Charity Engine ensures data privacy and security when using volunteered computing resources

  • • Market potential of distributed computing services in AI

  • • What you should consider when deploying a model in a low-resource context

  • • Why data buyers should demand transparency and ethical data handling from their providers

  • • The implications of using pre-trained models for businesses and how AI models may require additional fine-tuning and security checks to meet specific use case requirements

  • • How success from using synthetic data to train models varies by use case

  • • The legality of web scraping operations and the nuances of what, how, and why data is being scraped

  • • ​​How multiple types of statutes and legal theories play into the legal landscape of web scraping

  • • The distinction between browse wrap and click wrap terms of service and their enforceability

  • • The panelists’ perspective on the X vs. Bright Data case, focusing on allegations about the impact of scraping on target servers

  • • Discussions on the rising number of legal cases in the AI world, and the significance of the “monkey photo” case in the AI copyright discussions

  • This session was highly informative and well worth blocking out one hour to digest the nuanced discussion.

Closing Thoughts

Hope this snapshot piqued your curiosity and gave you a good foundation to start surfacing the rich insights from each talk. You can find the full playlist of the day-two sessions here.

You can also watch the on-demand talks from the past six years of Extract Summit here to gain a deeper perspective of how the landscape has evolved throughout the years.

If you didn’t manage to attend in 2024, here is your chance to register for the 2025 event. Do consider applying for a speaking slot!

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
Developer interest

Theresia Tanzil

Content Writer

More from this author

In this article

  • Web Data Extract Summit 2024 Recap
  • Technical Insights for Developers Doing Web Data Extraction
  • Topic
  • Session
  • Watch if you’re interested in hearing about
  • Business Insights For Business Leaders Buying Web Data
  • Session
  • Speaker(s)
  • Watch if you’re interested in hearing about
  • Closing Thoughts

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026