6 Key Takeaways from Extract Summit 2022

We are thrilled with the success of this year’s Extract Summit that took place last week in London, UK.

Finally, we were able to go back to the original in-person format.

It was very gratifying to see the web data community, industry experts and top professionals, connecting face-to-face once again.

With 200 attendees from around the world and an incredible line-up of 13 speakers with insightful talks that covered all aspects of web data extraction. The event had a good mix of panels hosted by industry experts from different organizations and Zyte executives.

Real stories, experiences and business experiments were shared to help demonstrate how to – work smarter, not harder – by leveraging open web data.

This article will show you the highlights and key takeaways of the event.

You can also watch all the video recordings here.

The changing world of web data extraction - Teams, Technology and Investment

Zyte CEO, Shane Evans, started the day with a talk that explored the trends and challenges of the web data extraction industry.

extract summit shane

The two big use cases he sees for web data are strategic insights or improving operations.

Plus build v buy is still top of mind for leadership with the amount being spent on web data at $5bn in 2022 with <25% on external - this trend is due to change with external spend growing to 40% by 2025.

extract summit web data spending

Web data spending stats:

54% in alternative data
23% in web data
Total spend close to $5 billion
External spending increase of 40%

He also took us through the common legal misconceptions around web data - which got a lot of nodding heads from the audience.

The bridge between academia and business - a new approach to web scraping

One of the most surprising things about the event was learning how web scraping is a valuable tool for academia. And yes, researchers face similar challenges when extracting web data, just like any web scraper does when trying to get data from a website.

Hannes Datta, Professor of Marketing at Tilburg University, gave an academic perspective on what quality data means to academic researchers.

A key takeaway here was on the perspectives of web scraping in academia vs commercial. They are very different – but at the end of the day – both need quality data.

Bringing together academia and business could help find a fresh approach to web scraping and uncover creative ways to solve data quality challenges.

What’s next for no-code web scraping?

No-code web scraping tools are in the spotlight for new approaches to extract data at scale.

As these become ever more popular, many companies have no-code products in their pipeline and already deploy internal procedures based on no-code web scraping. As a result, no code platforms are opening up new possibilities for data extraction, and are likely to play a major role in the future of web scraping.

Victor Bolu, CEO of WebAutomation, had valuable insight on this. His talk provided a behind-the-scenes look on how to build a successful no-code web scraping tool, along with an analysis of the no-code market for data extraction.

No-code may seem like a new player in the data extraction landscape, but it is definitely one to look out for in the near future.

Brushing up on legal for web scraping

Just like in most industries, GDPR, CFAA, contracts, copyright laws and more, also apply to web data extraction. And not everyone realizes that you first need to understand how to be legally and ethically compliant with your data extraction project.

Sanaea Daruwalla, Chief Legal Officer at Zyte, discussed data from a legal perspective, explaining the legality but also giving us some great insights on the ethics of web data extraction.

So before you jump into extracting web data, it is important to check the status and legality of the data (among many other things).

Those who attended this talk are now definitely more aware of how to be compliant.

Have you checked your data maturity level?

Probably the most elaborate and in-depth talk of the day.

A Web Data Maturity Model helps organizations understand their level of expertise in processing and handling web data.

Organizations have become more interested in open web data since the ways they can leverage it have become more clear.

The “why” and “how” behind Zyte’s Web Data Maturity Model

Researching and showcasing our Web Data Maturity Model was our way of giving back to our community.

James Kehoe, Product Manager at Zyte, gave a detailed step-by-step explanation of the reasons behind why and how this model was created.

Zyte’s Web Data Maturity Model resulted from extracting 13 billion pages/month for 5,000 customers and conducting interviews with 40+ industry experts.

To keep things simple, there are two important aspects to consider.

The “sequence” levels and the “usage” levels.

extract summit data maturity model

Horizontal: The 5 levels of data maturity sequence
Vertical: The 4 levels of data maturity usage

These can give you a deeper understanding of where your organization sits on the data maturity model and help you avoid common pitfalls. You are also better prepared to take your next steps in web data extraction and create a roadmap to increase your data maturity.

A new standard for intuitive and reliable data extraction

By the end of the event, it was clear that websites have become more difficult to scrape over recent years.

Our Head of Product, Iain Lennon, gave us a sneak peak at our newest innovation to tackle this problem Zyte API.

With Zyte API you can forget about proxies, bans and maintenance, one API that automatically uses the leanest setup to reliably return HTML from any website at the lowest cost, so you can forget about the tech and focus on the data.

Enter a URL, get data. It’s that simple.

Releasing October 27th so keep an eye out for updates on how you can start your free trial.

extract summit zyte api

A brief demonstration showed a simplified version of the typical crawl engineer workflow for solving bans with the most optimal cost effective solution.

And that’s not all, we expect to have an integrated extract and crawl solution with maintained spiders and machine learning in 2023.