We’ve just released a new open-source Scrapy middleware which makes it easy to integrate Zyte Automatic Extraction into your existing Scrapy spider.
If you haven’t heard about Zyte Automatic Extraction (formerly AutoExtract) yet, it’s an AI-based web scraping tool that automatically extracts data from web pages without the need to write any code.
Learn more about Zyte Automatic Extraction here.
This project uses Python 3.6+ and pip. A virtual environment is strongly encouraged.
$ pip install git+https://github.com/scrapinghub/scrapy-autoextract
DOWNLOADER_MIDDLEWARES = { 'scrapy_autoextract.AutoExtractMiddleware': 543, }
This middleware should be the last one to be executed so make sure to give it the highest value.
These settings must be defined in order for Zyte Automatic Extraction to work.
Optional
Zyte Automatic Extraction requests are opt-in and they must be enabled for each request, by adding:
meta['autoextract'] = {'enabled': True}
If the request was sent to Zyte Automatic Extraction, inside your Scrapy spider you can access the result through the meta attribute:
def parse(self, response): yield response.meta['autoextract']
In the Scrapy settings file:
DOWNLOADER_MIDDLEWARES = { 'scrapy_autoextract.AutoExtractMiddleware': 543, } # Disable AutoThrottle middleware AUTHTHROTTLE_ENABLED = False AUTOEXTRACT_USER = 'my_autoextract_apikey' AUTOEXTRACT_PAGE_TYPE = 'article'
In the spider:
class ExampleSpider(Spider): name = 'example' start_urls = ['example.com'] def start_requests(self): yield scrapy.Request(url, meta={'autoextract': {'enabled': True}}, callback=self.parse) def parse(self, response): yield response.meta['autoextract']
Example output:
[{ "query":{ "domain":"example.com", "userQuery":{ "url":"https://www.example.com/news/2019/oct/15/lorem-dolor-sit", "pageType":"article" }, "id":"1570771884892-800e44fc7cf49259" }, "article":{ "articleBody":"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat...", "description":"Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatu", "probability":0.9717171744215637, "inLanguage":"en", "headline":"'Lorem Ipsum Dolor Sit Amet", "author":"Attila Toth", "articleBodyHtml":"<article>nn<p>Lorem ipsum...", "images":["https://i.example.com/img/media/12a71d2200e99f9fff125972b88ff395f5e...",], "mainImage":"https://i.example.com/img/media/12a71d2200e99f9fff125972b88ff395f5e..."} }]
Check out the middleware on Github or learn more about Zyte Automatic Extraction (formerly AutoExtract)!