We are super excited to share some good news for all the Puppeteer users who are looking for an easy-to-integrate anti-ban solution for extracting data from javascript-heavy websites. We are very happy to announce the launch of a new library to help called Zyte SmartProxy Puppeteer
that's going to make your life so much easier!
At Zyte, the developer experience matters the most, and we wanted to give you a smooth experience of scraping dynamic websites with seamless integration between Puppeteer and our smart rotating proxy service, Zyte Smart Proxy Manager.
I’m going to give you a quick explanation of how to get started, it’s super easy!
Zyte SmartProxy Puppeteer library is a client library built on top of Puppeteer — a high-level API to control headless chrome, written to work seamlessly with Zyte Smart Proxy Manager. With this library, you will be able to make the best of the headless browser capabilities of Puppeteer and manage bans by unlocking the powerful proxy management tool - Zyte Smart Proxy Manager in your web scraping projects.
In this tutorial, I will demonstrate how your Puppeteer web scraping script will have superhero capabilities to
In order to run the script used in the tutorial, please make sure that you are ready with the following:
/usr/local/bin
is in your $PATH
environment variable.Installing Zyte SmartProxy Puppeteer library is super easy. Just run the following command using npm and it will automatically install the native Puppeteer library along with a stable up-to-date chromium version that is compatible with the version of Puppeteer being installed.
$ npm install zyte-smartproxy-puppeteer
Awesome, now that you are all set and configured. Let’s get the show started!
To demonstrate the integration between Zyte Smart Proxy Manager and our headless browser library for Puppeteer, we will write a script that will cause our headless browser to take a screenshot of ‘Web Scraping Sandbox’. This sandbox is developed by Zyte for demonstration purposes, feel free to play around with it and experiment with new techniques around web scraping.
Let’s start our Zyte SmartProxy Puppeteer tutorial with this basic example.
Create a new file with the name sample.js
and open it in your favorite code editor
const puppeteer = require('zyte-smartproxy-puppeteer');
headless, spm_apikey
, set `ignoreHTTPSErrors
` to true
.headless
` parameter to `false
`. This means that it will open the Chromium GUI.spm_apikey
`, as mentioned in the prerequisite above.const browser = await puppeteer.launch({ spm_apikey: '<<enter your API key here>>', ignoreHTTPSErrors: true, headless: false, });
console.log('Before new page');
const page = await browser.newPage();
`goto`
function. If the server responds to the request, it will open the web scraping sandbox, else it will throw an error in the logs.try { await page.goto('https://toscrape.com/', {timeout: 180000}); } catch(err) { console.log(err); }
‘screenshot’
function. In the path argument, give the path to the directory where you want to save the screenshot. The path used in this script will save the screenshot in your current directory which contains `sample.js`
.await page.screenshot({path: 'screenshot.png'});
await browser.close();
const puppeteer = require('zyte-smartproxy-puppeteer'); (async () => { const browser = await puppeteer.launch({ spm_apikey: '<<enter your API key here>>', ignoreHTTPSErrors: true, headless: false, }); console.log('Before new page'); const page = await browser.newPage(); console.log('Opening page ...'); try { await page.goto('https://toscrape.com/', {timeout: 180000}); } catch(err) { console.log(err); } console.log('Taking a screenshot ...'); await page.screenshot({path: 'screenshot.png'}); await browser.close(); })();
Execute script on the command line.
$ node sample.js
If your script runs successfully, you should be able to see the following in your terminal
And also, ‘screenshot.png’
in your project folder.
In addition to easy integration and management of headless capabilities of Puppeteer with Zyte Smart Proxy Manager, our library provides additional functionalities such as
‘block_ads’
argument and set it ‘true' and the library will block ads defined by block_list using @cliqz/adblocker-puppeteer package.‘static_bypass’
argument and set it to ‘true’
. and the library will skip the proxy used for static assets defined by `static_bypass_regex`
or pass false to use the proxy.Important note: block_ads
and static_bypass
are enabled by default. Some websites may not work with block_ads
and static_bypass
enabled. Try disabling them if you encounter any issues. To know more about these functionalities, read here.
Using libraries like Zyte SmartProxy Puppeteer
can make it so much easier to work with dynamic websites and manage bans and proxies all together in a single piece of code. Later this month, on the 22nd of June, I will be hosting a webinar to demonstrate the true power of this new integration and show you how to make the most out of it. So be sure to join me!
This webinar will be a good opportunity for you to interact with our web scraping experts and clarify your doubts on the fly while doing hands-on integration of these libraries.
If you are new to headless browsers, Puppeteer and Zyte Smart Proxy Manager. Here are a few links to learn more about these topics. I hope you find them useful.