It is well known that many websites show different content depending on the region where they’re accessed. For example, some retailer sites show products available only for the region (US, Europe) of the user accessing the site.
Although this can be quite convenient for the website customers, it can be a pain for developers writing a spider for the site and running it from their local machines.
There is a simple way to proxy all requests as if they came from another server. You only need SSH access to this other server, no need to install any HTTP proxy. For this, you can use a program called tsocks.
Here’s how to do it in Ubuntu, though this recipe should be easy to extended to other Linux distros.
First, install tsocks with:
$ apt-get install tsocks
Then add this content to
~/.tsocksrc (update: recent versions settings are stored at
~/.tsocks.conf, but it may vary across distributions):
server = 127.0.0.1 server_type = 5 server_port = 9999
Next, SSH to the remote server you want to use:
$ ssh -D 9999 some_remote_server
And finally, in another terminal (without closing the SSH console), just run Scrapy by prefixing it with the
tsocks command, like this:
$ tsocks scrapy crawl myspider
That’s all. Your spider will run in your local machine but proxying all communication through the remote server. No need to change any settings or configuration.