Crawling the TOR network – Challenge Accepted!

Posted on January 13, 2020 by Ran Geva

read the article

The following short story portrays the surprising technological and logical challenges we faced while developing our dark web crawling technology.

Back in 2017 when I initially had the idea of adding content from the TOR network to our data repository, I thought it should be quite straight forward. The idea was to leverage our current crawling technology that downloads and automatically structures content (extract the content, author name, comments, dates etc.) from the open web, and just use a proxy to connect to the TOR network. I thought we will be able to shed light on the criminal activity going on in the darkest corners of the web. Did it work? Well kinda… actually not really. 

We were able to crawl the TOR network and download some data, and we even got a lot of traction, however when push comes to shove, our clients didn’t find a lot of value in the content we’ve found. The question raised was, are we doing something wrong? Or maybe the TOR network is nothing but a fad? Well, we were doing something wrong, and down the rabbit hole we went.

Challenge #1 – Finding the hidden gems

Nobody does SEO on the darkweb. Illicit services are hard to find and usually are on a need to know basis. The solution wasn’t that complicated since we already had a seed list of sites. We created a discovery service that crawled external links found in anchors and domains that were mentioned in the text. It was unbelievably efficient.

The next problem we now faced was that many domains were down, went down, or quickly changed mirrors. The solution was to group them together by the site’s name. It is an ongoing process of constant discovery as sites don’t live long on the dark web.

The next issue was that all the “good stuff” (and by good I mean really bad), was on sites that require login.

Challenge #2 – You shall not pass!

While examining the number of pages we found per domain, we found that many domains have only one page available. Visiting those domains revealed that the reason was that those pages were usually a login page. Signing up usually unravelled a treasure trove of activity – jackpot! 

Alas, the technology we utilize on the open web doesn’t include automatic form authentication and CAPTCHA bypassing. I’m grateful to my amazing talented team that after a few weeks of iterations and experiments we were able to go through that hurdle as well. 

It’s important to note that JavaScript is usually forbidden on the TOR network as it can be used to fingerprint your browser, so on the one hand it made crawling a bit easier, however CAPTCHA solving could be unique to each site since each site develops its own solution and doesn’t necessary use out of the box services.

Another issue is that not all dark services let you create an account on your own, some are on an invite only basis. In these cases the process is long and not trivial and it involves long engagements on multiple illicit platforms.

Challenge #3 – Don’t look at me, I’m just like you

Some sites that either require an invitation or are hard to get into will also utilize some tracking methods to find out if they are being scanned. You don’t want to get banned after investing a lot to get in! 

They can’t detect us by using our IP as what protects them from identification also protects our crawlers. However since we now use avatars to crawl their website, we need to act like a human. 

Unfortunately, that means we had to develop a new kind of an inefficient crawler that utilizes a headless browser, that loads all the resources (CSS, images etc.) and doesn’t visit the website 24/7. Stupid humans! To make the most of what we had, we had to optimize it to visit only the most important sections of the site.

Challenge #4 – Needle in a haystack

Ok, we are making progress. We now crawl the TOR network, we find new sites, we infiltrate forums and market places. We index all the content we find to our ElasticSearch cloud – it should now be easy to find the good (bad) stuff, right? Well, you guessed it – it’s not!

Apparently finding relevant and illicit activity requires expertise. Unlike using Google, searching for a brand, or a name won’t result in the most relevant results. You need to use special jargon that is specific for each use case and is different for each subject you search for. 

In order to help our clients to quickly find value in the data we collect we used machine learning to categorize our data into multiple categories: drugs, terror, PII (personal identifiable information), hacking, weapons, financial and sexual content. By doing so, it’s easier to quickly locate relevant information and from there, learn how to refine the queries that will pinpoint the relevant data.

We automatically detect and extract specific entities such as credit card numbers, social security numbers, phone numbers, emails and passwords, cryptocurrency addresses and more. This feature makes it easy to locate specific leaked information. It’s important to note that we sanitize validated sensitive information such as credit card numbers and passwords to prevent abuse. 

We also leveraged image recognition to overcome the language barrier, whether it’s  multilingual sites, or new unknown jargon that describes new merchandise. By identifying the objects that are shown in the images presented on the websites, one can search for drugs or weapons without words. Pictures speak louder than words.

Challenge #5  – Is it kosher?

The short answer is yes, but there are rules to abide. I don’t want to go too deep into the legal details in this article. However the main principles are:

  • Not anyone can gain access to the data. You need to run a KYC (Know Your Customer) and make sure your clients search data about themselves and do not abuse the data to harm others. 
  • We do not download or store images as they might contain child pornography.
  • If we detect sensitive validated breached information (DB breaches), we don’t save the actual data, we sanitize and clean it to prevent abuse.
  • We monitor the use of our system and have strict agreements with our clients about how they can use it.

Challenge #6 – Oh no, there are more

So, apparently there are more dark networks other than TOR. ZeroNet, I2P, the open web and many chat applications to name a few. Hackers are running around trying to find the right platform to abuse, and we need to run after them and match each platform with the right technology.

The bottom line

It’s a cat and mouse game. There are far more challenges than I described above. However, we are proud to be able to successfully crawl the TOR network to help governments, law enforcement agencies and many commercial companies to prevent fraud, money laundering, drug abuse and fight the bad guys. It is both a challenging and rewarding battle, so the bottom line is that it’s worth it!