Thursday, February 11, 2016

Tor IP changing and web scraping

Anonymous web scraping

Many of us who scrape web pages, be it for fun, data, love or something else, are concerned about anonymity. Well, not anonymity per se, we just don't want our IP blacklisted.

Many of us choose Tor and its network to achieve our goal. In this post I am going to share my experience and observations using Tor + Privoxy for Python driven web scraping.


First thing first, prepare your system

There is a lot of articles about how to install and setup Tor and Privoxy locally. My personal favorites are:

Install Stem (Python controller library for Tor) to manage Tor from Python.

Alternatively you can refer to my A step-by-step guide how to use Python with Tor and Privoxy.


Using the above mentioned tutorials requires to set a 'HashedControlPassword'. It is possible, though not safe, to use Tor without a password. Please refer to A step-by-step guide how to use Tor without Authentication on how to do it.


Changing IP address
Using Stem is simple. Use the following piece of code to change the IP:

from stem import Signal
from stem.control import Controller


def set_new_ip():
    """Change IP using TOR"""
    with Controller.from_port(port=9051) as controller:
        controller.authenticate(password='tor_password')
        controller.signal(Signal.NEWNYM)


Requesting data
And this is how you can fire requests through Tor. Again, it's simple, just define proxy settings to be sure that all your traffic is routed via Tor and Privoxy. The following request will return the Tor IP address which you are currently using. Note we are employing http://icanhazip.com here, which is a great service to determine your current IP.

NOTE Your host, port and password may be different.

import requests

local_proxy = '127.0.0.1:8118'
http_proxy = {
    'http': local_proxy,
    'https': local_proxy
}

current_ip = requests.get(
    url='http://icanhazip.com/',
    proxies=http_proxy,
    verify=False
)

However, it's also possible to leave out Privoxy entirely by routing your requests directly via SOCKS proxy. But, be aware that "Applications that do DNS resolves themselves may leak information".

import requests

local_proxy = 'socks5://localhost:9050'
socks_proxy = {
    'http': local_proxy,
    'https': local_proxy
}

current_ip = requests.get(
    url='http://icanhazip.com/',
    proxies=socks_proxy,
    verify=False
)
NOTE You have to add support for the SOCKS protocol to Requests.
pip install -U requests[socks]
Refer to the following StackOverflow question for more details: How to make python Requests work via socks proxy

Actual web scraping

Above mentioned code samples are the foundation for my web scraping pipeline ScrapeMeAgain. As IP address switching is a frequent operation, TorIpChanger provides you a couple of handy helper functions managing IP reuse. The reason for this is simple: Tor will quite often change the IP address to one which you have used already. In some cases it will even return the same IP address you are using currently as a new one! At least with the solution I am using.

This behavior is also mentioned in Stem FAQ:

An important thing to note is that a new circuit does not necessarily mean a new IP address. Paths are randomly selected based on heuristics like speed and stability. There are only so many large exits in the Tor network, so it's not uncommon to reuse an exit you have had previously.

Because of this I came up with two simple web scraping rules:

  • Change the IP address after 50 requests
  • Enable reusing a given IP address only when 10 different IP addresses were used before

Log analysis and geolocation

Logging what is going on is absolutely essential. For many reasons, debugging being one of the most important. In fact, inspecting logs helped me discover Tor IP address reusing behavior. So here are some numbers for one of my web scraping sessions (with above rules applied):

  • Total requests: 213 565
  • Total IP changes: 4 239
  • Unique IP addresses used: 459
  • Only 5 nodes weren't reused
  • Each IP address was reused 10 times on average

As someone geo-infected (I studied Geoinformatics) it was only natural to convert logged IPs to a map. Therefore I geo-located these IPs with db-ip.com and the results are very interesting.

Top 5 most frequently used locations

  1. Roubaix, France (used 107 times)
  2. Toulouse, France (used 88 times)
  3. Steinsel, Luxembourg (used 81 times)
  4. Hunenberg, Switzerland (used 69 times)
  5. Amsterdam, Netherlands (used 65 times)
Top 5 most frequently used countries
  1. France (used 721 times)
  2. Germany (used 640 times)
  3. Netherlands (used 592 times)
  4. United States (used 575 times)
  5. Sweden (used 261 times)
First 5 used Tor nodes
  1. Ukrainka, Ukraine
  2. Cambridge, United States
  3. Amsterdam, Netherlands
  4. Nijmegen, Netherlands
  5. Budapest, Hungary
Tor nodes were localized in 38 different countries. 333 of total 459 unique IPs were gelocated in Europe, 93 in the United States. Almost all Tor nodes (except 4) were on the Northern Hemisphere.
EDIT (27.01.2017)
A follow-up up to this post is a summary of how I used Tor in 2016: A year with Tor.