We held the 2022 Web Data Extraction Summit three weeks ago. I wanted to extend a huge thank you to everyone who came, especially our guest speakers, who shared some great insights throughout the day.
The summit has changed a lot over the last few years, so I thought I’d take some time to reflect and talk about some of my favorite moments. Don’t worry if you couldn’t make it along. All of our recordings are now live, so if you’re interested in anything I mention, you can learn more using the link below.
Despite months of planning, I was still a little nervous as the day began. Between turnout, technical glitches, and cancellations, there was plenty that could’ve gone wrong, but it was all worth it. We had a good split of in-person and virtual guests, and nearly everyone stayed until the end.
In addition to the learning aspect, the summit had a great social atmosphere. It was encouraging to see attendees get along, engage with the speakers and come along to the after-party. I even met some attendees in a coffee shop the day after and we caught up before my flight home.
"Very professional and inclusive. Thought provoking content. Very good use of my time."
We tried experimenting with a new format from last year. Instead of parallel events, we delivered a more linear agenda which, I think, meant that there was plenty of variety throughout the day. One minute there was a live code demonstration, the next, our Chief Legal Officer, Sanaea Daruwalla, was walking people through the regulations and ethics of web scraping.
Overall, I think we managed to exhibit a healthy cross-section of current and future issues within data extraction. We’re still at a stage where knowledge transmission is slow, so I think attendees particularly enjoyed finding they shared the same struggles and learning new approaches.
With things being back in-person this year, I was completely blown away by the effort that attendees made to come along. We had people appear from every corner of the globe, including the US, Europe, India, and Asia. We always wanted the Web Data Extraction Summit to be ‘the event’ for our industry, so I’m always taken back at just how popular it’s become.
The web data extraction industry feels so much larger nowadays and I think that growth is reflected in the types of people you meet and projects you hear about. Like previous years, we had lots of software engineers, developers and the like come along. However, one thing that stood out to me this time round was that attendees were more senior. I spoke to managers, team leads, CEOs, and even whole data teams.
I also had the chance to speak with a university professor who attended the summit. They mentioned that they’d previously been teaching web data extraction to post-grads but had recently delivered sessions to undergraduate students.
Altogether, I think this suggests we’re entering a new era of web data extraction where our tools and methods will become more mainstream and accessible - something that Victor Bolu highlighted in his talk The future of no-code web scraping.
A lot of the talks were reflective of the trends I’d identified in my state of the industry address and it was interesting to see the different approaches to each topic. For example, Neil Emeigh’s session explored the use of web proxies to combat bans and how developers need to be mindful of ethical sourcing throughout.
This built on some of the compliance and scaling issues I had alluded to earlier and segued well into Sanaea Daruwalla’s discussion on the legal do’s and don’ts of web data extraction. As web extraction grows, I believe it’s important to champion these standards, especially in areas where we’re outpacing existing regulations.
Later, Glen De Cauwsemaecker gave an excellent presentation on maintaining data quality while growing your data feeds. The topic of scaling comes up each year, so it was refreshing to see how he balanced the need for actionable insights with his growth aspirations. Glen walked us through lessons from building extraction infrastructure over the last decade, much of which overlapped with James Kehoe’s session on the data maturity model.
Of course, it wouldn’t be a conference on web data extraction without numerous talks on crawling methods. We were joined by Peter Bray and Guillaume Pitel, each of whom explored how machine learning could hasten and enhance the data extraction and categorization processes. I think we’ll begin to see more of these tools as web data extraction is applied to new use cases and research questions, so make sure you catch their talks at some point.
Finally, two sessions really stood out for me personally for utilizing web data extraction in truly novel ways. Firstly was Alexander Lebedev’s session on data mining amid the Ukrainian war. He gave an intimate look into a global conflict and showed how web data extraction could help him navigate fundamental questions like when to sleep in a war zone. Secondly was Hannes Datta’s talk on the use of web data in academic research. Similar to Alexander’s talk, Hannes outlined how web data extraction could help researchers understand human behavior and online trends.
I’m incredibly grateful to everyone who came along to our 2022 summit, so I’d like to share one last thank you; we hope you enjoyed this year’s event as much as we did. We’re already making plans for next year, so if you’d like to stay in the loop and hear updates, you can register for early access tickets.
Alternatively, if you weren’t able to make it along or you’d like to watch some of the sessions again, I’m pleased to announce that all of our recordings are now live! You can watch them all for free as many times as you like using the link here.