Note: Portia is no longer available for new users. It has been disabled for all the new organisations from August 20, 2018 onward.
As with everything in software, we started out by investigating what our requirements were and what others had done in this situation. We were looking for a solution that was reliable and would allow for reproducible interaction with the web pages.
Reliability: A solution that could render the pages in the same way during spider creation and crawling.
Interaction: A system that would allow us to record the user's actions so that they could be replayed while crawling.
The results of the investigation produced some interesting and some crazy ideas, here are the ones we probed further:
We rejected 7 and 8 because they would increase the barrier of entry for using Portia and make it more difficult to use. This method is used by Import.io for their spider creation tool.
1 and 2 were rejected because it would be hard to fit the whole Portia UI into an add on in the way we'd prefer, although we may revisit these options in the future. ParseHub and Kimono use these methods to great effect.
3 and 4 were investigated further, inspired by the work done by LibreOffice for their Android document editor. In the end, though it was clunky and we could achieve better performance by sending DOM updates rather than image tiles.
The solution we have now built is a combination of 5 and 6. The most important aspect is the server-side browser. This browser provides a tab for each user allowing the page to be loaded and interacted within a controlled manner.
We looked at using existing solutions including Selenium, PhantomJS and Splash. All of these technologies are wrappers around WebKit providing domain-specific functionality. We use Splash for our browser not because it is a Scrapinghub technology but because it is designed to be used for web crawling rather than automated testing making it a better fit for our requirements.
The server-side browser gets input from the user. Websockets are used to send events and DOM updates between the user and the server. Initially, we looked at React's virtual DOM, and while it worked it wasn't perfect. Luckily, there is an inbuilt solution, available in most browsers released since 2012, called MutationObserver. This in conjunction with the Mutation Summary library allows us to update the page in the UI for the user when they interact with it.
We now proxy all of the resources that the page needs rather than loading them from the host. The advantage of this is that we can load resources from the cache in our server-side browser or from the original host and provide SSL protection to the resources if the host doesn't already provide it.
For now, we’re very happy with how it works and hope it will make it easier for users to extract the data they need.
Note: Portia is no longer available for new users. It has been disabled for all the new organisations from August 20, 2018 onward. Check out our Automatic Data Extraction solution.