PINGDOM_CHECK
Voy Zeglin
10 mins
June 10, 2024

Voy Zeglin, DataMiners CEO: Pioneering web scraping technologies and building a developer community, Part 2

In the second part of our in-depth interview with Voy Zeglin, we delve into the nitty-gritty of web scraping. Voy shares his expertise on how his company, Data Miners, deals with anti-scraping measures and maintains high data quality. We also discuss the post-extraction processes, the legal and compliance issues intrinsic to web scraping, and look ahead at future trends in the industry. Voy advises aspiring developers and data analysts and gives us his take on the Zyte technology stack. 

Dealing with Anti-Scraping Measures:


How do you handle website anti-scraping measures like CAPTCHA, IP bans, or rate limits?


Right. Now we’re getting to the nitty-gritty. Some time ago CAPTCHAs could be addressed, and still are, by using services known as captcha solvers, which combine state of the art algorithms to solve easier challenges, with human-in-the-loop or HITL for more difficult cases. This would simply mean that someone, somewhere, physically solves your captcha for you. Quite ineffective when compared to what AI can offer nowadays. 


We are actually at the tipping point where CAPTCHAs will probably become obsolete and replaced by another technology, like digital IDs. This is simply a derivative effect of AI competing with itself. 


IP bans and rate limits can easily be addressed by taking advantage of large, commercial scale IP pools from established proxy providers. There’s no need to reinvent the wheel, building a proxy farm from the ground up is an overkill, I know something about it. You can build a whole business around this one service itself. 


Are there any ethical considerations you keep in mind while circumventing these measures?


Sure, ethics are always important. And in my opinion data extraction can be done ethically. There is a thin line between the ethical aspects of web scraping and the legal environment. If we’re talking about ethics, it’s crucial to remember that we don’t exist in a vacuum and our actions impact other people's lives and businesses directly. This can be so conveniently ignored when we’re sitting in our basement, just hitting the keyboard for hours. There is the human aspect of our digital activities and the two are inseparable. Having the moral compass always pointing in the right direction is crucial, something I’d like to execute more often myself.


Data Quality and Monitoring:


What strategies do you use for data extraction, storage, and management?


We do use various cloud services extensively, but sometimes we resort to more custom solutions and private servers. Some clients have their own hardware infrastructure that we use. SQL and NoSQL databases are the most obvious choices when it comes to database architecture, along with a plethora of management applications.


How do you ensure the quality and accuracy of the scraped data?


To maintain data quality, it's crucial to have clearly defined requirements, efficient quality assurance processes and continuous monitoring. Automation plays a significant role in quality assurance. Additionally, a multi-layered approach to quality assurance ensures thorough validation before delivering data to clients. 


How do you monitor the large-scale web scraping projects?


We usually use our in-house solutions for data monitoring, or use some customized versions of commercially available tools. Monitoring large-scale web scraping projects involves a comprehensive approach to proxy management and circumvention strategies. Firstly, understanding the traffic profile of the project is crucial, including the targeted websites, request volume, and geographic locations for which data is needed. Secondly, establishing a robust proxy pool tailored to the traffic profile is essential, considering factors such as the number of proxies required, their locations, and type (data center or residential). Finally, efficient proxy management is important for long-term scalability, involving intelligent proxy rotation for example.


What happens after:


What happens after the data is extracted, how do you deliver it to the customers?


We have several methods of delivering the data to our customers. Depending on the complexity of the project and some other factors, we can offer a whole range of approaches. Sometimes even simple email will work, and sometimes it has to be more complex. If our clients have a decentralized receiving microservice array on their end, we design our delivery mechanisms accordingly.


Do customers expect you to do the analytics, and build the dashboard for the usage of the product?


Yes, sometimes we also deliver “the packaging”, so to speak. Our customers do not always have the resources  to create complex data analytics systems, and we always gladly help. We’ve built a few of these already.


What does the data pipeline look like?


Data pipelines are elements of more sophisticated data delivery processes. They automate the flow of data from our infrastructure to so-called “sinks” on the client’s side, which is the destination. Data is picked up by the processing engine and various operations are performed to produce the final, expected result, like a pretty pie chart on the director’s computer!


Legal and Compliance Aspects:


How do you navigate the legal and compliance aspects of web scraping?


From the service provider point of view, I could not imagine operating in this environment without a dedicated legal team, especially when you start working with corporate clients. Depending on your jurisdiction different laws will apply. You always need to keep your finger on the pulse. Legal aspects are often brushed off, it’s great what you do here at Zyte to spread awareness by addressing these issues. It helps a lot of people.


What advice would you give to someone new to web scraping regarding legal considerations?


Basically, make sure to always follow your local laws, don’t break the Internet, don’t do anything illegal, like stealing the data or breaking into the systems. If you send too many requests too fast, that can lead to a denial of service. Remember that we have a thing called copyrights which are enforced almost anywhere in the world. You can’t always “download something from the internet” and use it freely, for example a random photo you find on Google images. If you are not sure about something, seek legal advice! Ask around, most likely you will find someone who encountered your issues before you.


Future Trends and Advice:


Where do you see the future of web scraping heading in the next few years?


Oh boy. Nothing lasts forever and the same applies to technology, maybe even primarily so in current day and age. Advancements are moving at a much faster pace than ever before. I believe in the open source ideology and cooperation between developers, who create new services that aggregate distributed pieces of code all the time. I am sure this landscape will evolve, but only time will show how. Maybe we won’t need proxies? Maybe the captchas will go away? Or maybe on the contrary, if we start using digital IDs, maybe this will make web scraping more difficult? I’m curious how the internet will look ten years from now and what kind of devices we will use to access it. That will shape our industry’s future.


What advice would you offer to someone looking to start or improve their web scraping projects?


Good advice is something a man gives when he is too old to set a bad example. But jokes aside, if you are just starting out, remember that learning this skill is something of a career choice. It’s a competence you will have to nurture and spend many months or even years if you want to become proficient (unless you get a chip in your brain). It’s quite important to join a few online communities as soon as possible and find a mentor, even a YouTube channel will be good, there are many brilliant teachers over there.


It's also super important to choose your clients carefully. There are so many temptations to just work at any project that comes our way. The competition is fierce and we are inclined to throw ourselves at anything that pops up, no matter the side cost and collateral damage. I understand it’s often that our livelihoods are at stake and it may be very difficult to turn down an offer which comes along, but let’s all try to do the right thing, for our collective sake.


Thoughts on Zyte Tech Stack:


Share your experience using the Zyte Tech Stack: Scrapy, Zyte API, Scrapy Cloud.


In general, getting started with the Zyte Tech Stack was a breeze. Scrapy Cloud provided a solid foundation for crafting web crawlers effortlessly. It's like a command center for deploying and managing the spiders, offering convenient scheduling and scaling options. Smooth sailing all around. 


With the Zyte API in play, it’s possible to elevate your scraping game, tapping into features not readily available through standard methods. I don’t have any particular feature requests at this point, keeping fingers crossed for you guys!


Final Thoughts:


Thank you for the opportunity to talk to you, I am very passionate about this topic and the last several years of my professional career has been focused around data extraction. 


Do you have any resources, tools, or communities you would recommend?


Actually, yes. I encourage everyone interested to visit our Facebook group called “Web Scraping World” (look for a magenta DataMiners logo), where you can chat with our community members, find some projects to work on and get some more guidance. 


Personally, I am also a big fan of Pierluigi Vinciguerra from The Web Scraping Club. He’s one of the best teachers of web scraping nowadays. Join Discord, groups like “Web Scraping & Data Extraction”, “Extract Data Community”, “Web Scraping and Automation”, and the before mentioned “The Web Scraping Club”.


Lastly, if anyone would like to reach out to me directly, you can find me as Voy Zeglin on LinkedIn or as @zeglin on Twitter.