This is the first issue of the Zyte Developers Community newsletter. Enjoy!
In this issue:
- Scrapy plays well with Playwright
- The latest Python security fix might have broken your spider
- IP rotation open source project in Kotlin
- Beautifulsoup vs Selenium vs Scrapy
- Scrapy 2.5 is in the works
- Advanced Python web scraping: best practices & workarounds
Scrapy plays well with Playwright
Playwright is a popular tool for testing and automating Chromium, Firefox or WebKit. But as it's always the case with browser driver tools, it has been a popular choice for web scraping devs as well.
And yes, there's a Scrapy integration (though not so known yet, give it some love!)
So if you like the convenience of Scrapy, but need to act like a real browser when scraping in the wild, check out scrapy-playwright.
The latest Python security fix might have broken your spider
There has been an update to the urlparse module which might have caused some spiders to break. To be fair, most developers are probably not affected and also there is no issue raised about it yet - which might mean it's not that big of a problem. But it's good to know.
IP rotation open source project in Kotlin
Luka Spahija not so long ago open sourced his IP rotation tool written in Kotlin. It's called Torchestrator. Why the name? Because it spins up Tor containers with different IP addresses. Check it out on Github.
Scrapy 2.5 is in the works
...and it will support HTTP/2 (experimental). Check the pending work here. It will be released soon.
Beautifulsoup vs Selenium vs Scrapy
I'm glad to see more and more web scraping video creators popping up in my feed on YT. I'd like to highlight one video that caught my eye recently:
- Beautifulsoup vs Selenium vs Scrapy - Which tool for web scraping in 2021?
John gives a good simple explanation about the differences of these tools and when to use them - very beginner friend.
Advanced Python Web Scraping: Best Practices & Workarounds
This is a fairly long read but there are some good tips and tricks mentioned that a developer should know when scraping the web. Starting from the basics and going into advanced topics like header inspection, honeypots, captchas etc...