If you have a need to scrape web data anonymously you may find this guide useful and a quick start to getting up and scraping with Scrapy, Tor and Privoxy.

Operating System: Ubuntu 17.10 Artful Aardvark


An open source and collaborative framework for extracting the data you need from websites.
In a fast, simple, yet extensible way.

Install Scrapy

pip install Scrapy

Create an example project

scrapy startproject example

Create our spider

cd example/example
scrapy genspider -t crawl scrapyorg


Tor protects you by bouncing your communications around a distributed network of relays run by volunteers all around the world

Install Tor

apt install tor

Start Tor

service tor start

Update your Scrapy project settings, edit and add

    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7',
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) Gecko/16.0 Firefox/16.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10'
     'example.middlewares.RandomUserAgentMiddleware': 400,
     'example.middlewares.ProxyMiddleware': 410,
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None

then add the following middleware to

import random
from scrapy.conf import settings

class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        ua  = random.choice(settings['USER_AGENT_LIST'])
        if ua:
            request.headers.setdefault('User-Agent', ua)

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = settings['HTTP_PROXY']

This is telling Scrapy to ignore the robots.txt rules. It is also telling Scrapy to disable the built in user-agent middleware and to use our own. Finally, it is telling Scrapy to proxy to port 8118 (Privoxy).


Privoxy is a non-caching web proxy with advanced filtering capabilities for enhancing privacy

Install Privoxy

apt install privoxy

Update Privoxy to forward to port 9050 (Tor) edit /etc/privoxy/config and add/uncomment

forward-socks5t / .

restart privoxy

service privoxy restart

Finally run our Scrapy spider through Tor forwarded via Privoxy

scrapy crawl scrapyorg

You should see

DEBUG: Crawled (200) <GET>

This is now scraping with a random user agent (browser) on a different I.P address to your server.

There are many different options for Tor, Scrapy and Privoxy but this should be enough to get you on your way.

