Anonymous scraping with Scrapy, Tor and Privoxy

Home > Blog > Scraping > Anonymous scraping with Scrapy, Tor and Privoxy

If you have a need to scrape web data anonymously you may find this guide useful and a quick start to getting up and scraping with Scrapy, Tor and Privoxy.

Operating System: Ubuntu 17.10 Artful Aardvark

Scrapy

An open source and collaborative framework for extracting the data you need from websites.
In a fast, simple, yet extensible way.

Install Scrapy

pip install Scrapy

Create an example project

scrapy startproject example

Create our spider

cd example/example
scrapy genspider -t crawl scrapyorg scrapy.org

Tor

Tor protects you by bouncing your communications around a distributed network of relays run by volunteers all around the world

Install Tor

apt install tor

Start Tor

service tor start

Update your Scrapy project settings, edit settings.py and add

ROBOTSTXT_OBEY = False
USER_AGENT_LIST = [
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7',
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) Gecko/16.0 Firefox/16.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10'
]
DOWNLOADER_MIDDLEWARES = {
     'example.middlewares.RandomUserAgentMiddleware': 400,
     'example.middlewares.ProxyMiddleware': 410,
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None
}
HTTP_PROXY = 'http://127.0.0.1:8118'

then add the following middleware to middlewares.py

import random
from scrapy.conf import settings

class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        ua  = random.choice(settings['USER_AGENT_LIST'])
        if ua:
            request.headers.setdefault('User-Agent', ua)

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = settings['HTTP_PROXY']

This is telling Scrapy to ignore the robots.txt rules. It is also telling Scrapy to disable the built in user-agent middleware and to use our own. Finally, it is telling Scrapy to proxy to port 8118 (Privoxy).

Privoxy

Privoxy is a non-caching web proxy with advanced filtering capabilities for enhancing privacy

Install Privoxy

apt install privoxy

Update Privoxy to forward to port 9050 (Tor) edit /etc/privoxy/config and add/uncomment

forward-socks5t / 127.0.0.1:9050 .

restart privoxy

service privoxy restart

Finally run our Scrapy spider through Tor forwarded via Privoxy

scrapy crawl scrapyorg

You should see

DEBUG: Crawled (200) <GET https://scrapy.org/>

This is now scraping with a random user agent (browser) on a different I.P address to your server.

There are many different options for Tor, Scrapy and Privoxy but this should be enough to get you on your way.


Posted in by in Scraping

Want to work together?

If you have an existing Ecommerce or an Ecommerce project idea, get in touch.
I'd be happy to volunteer my time to some not-for-profit charities.

Hire me