Scaling and Maintaining Crawling and Extracting Solutions

4. Scaling and maintaining crawling and extracting solutions

A robust, responsive and reliable web scraping tech stack is key to scaling. Maintenance, reusability, portability, and scalability are challenges team’s face when scaling web scraping operations.

Planning for never-ending maintenance

Web scraping would be easy if websites were static entities. Unfortunately, they’re not and they change often. User interfaces are periodically updated to change layouts, navigation structures or the user experience. Structural changes to the underlying HTML and CSS of a website can cause changes to tags, class names and IDs. For JavaScript-heavy websites there can be changes to scripts or methods. Even updates to the content management system can trigger changes. The biggest maintenance burden for scaling web scraping projects, however, is encountering websites with anti-bot mechanisms deterring web scraping.

These changes break web crawlers and extractors. Your critical data feeds can be broken by changes at any time, impacting downstream systems. It’s a game of whack-a-mole that needs an effective, automated and detailed monitoring and alerting system. You can’t scale unless you’re using automation extensively in your maintenance routines.

Another aspect of maintenance to remember is infrastructure. Custom infrastructure with hosting and compute servers, proxy waterfalls and a lot of custom integration is all going to need maintenance by an expert team.

Continue to the next chapter 5. Adding AI to the web scraping stack