Name		Name	Last commit message	Last commit date
parent directory ..
dirbot		dirbot
.gitignore		.gitignore
README.md		README.md
inject.py		inject.py
scrapy.cfg		scrapy.cfg
setup.py		setup.py

README.md

OnionBot

This is a Scrapy project to crawl .onion websites from the Tor network.

It requires Tor software (use Tor2web mode), Polipo, and the data is saved to Apache Solr.

Short guide to OnionBot

Install Tor in Tor2web mode
Install Polipo
Install Solr

Make sure your Solr is empty (delete all):

http://localhost:8080/solr/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E&commit=true

Configurate the schema to Solr:

$ cp /etc/solr/conf/schema.xml schema.xml.backup
$ sudo cp ahmia/solr/schema.xml /etc/solr/conf/schema.xml
$ sudo cp ahmia/solr/stopwords_en.txt /etc/solr/conf/
$ sudo service tomcat6 restart

Setup Polipo:

$ cp ahmia/polipo_conf /etc/polipo/config
$ service polipo restart

Edit the crawler's DEPTH_LIMIT = 1:

$ nano ahmia/onionbot/dirbot/settings.py

Run the crawler software and produce JSON file:

$ scrapy crawl OnionSpider -o items.json -t json

or something similar with

$ scrapy crawl OnionSpider -s MAX_PER_DOMAIN=100 -s DEPTH_LIMIT=4

Test your Solr:

http://127.0.0.1:33433/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on

Delete a domain aaaaaaaaaaaaaaaa:

curl "http://localhost:33433/solr/update?commit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>domain:*aaaaaaaaaaaaaaaa*</query></delete>"

Delete old data, older than 14 days:

curl "http://localhost:33433/solr/update?commit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>date_inserted:[* TO NOW-14DAYS]</query></delete>"

Items

The items scraped by this project are websites, and the item is defined in the

dirbot.items.CrawledWebsiteItem

and the data is

domain, url, tor2web_url, title, text, date_inserted

Spiders

This project contains one spider called OnionSpider that you can see by running:

$ scrapy list
Spider: OnionSpider

Middlewares

Middlewares define HTTP proxy for .onion domains. Note that only .onion domains can be crawled using the HTTP proxy.

Furthermore, non-text responses are filtered out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

onionbot

onionbot

README.md

OnionBot

Short guide to OnionBot

Items

Spiders

Middlewares

Files

onionbot

Directory actions

More options

Directory actions

More options

Latest commit

History

onionbot

Folders and files

parent directory

README.md

OnionBot

Short guide to OnionBot

Items

Spiders

Middlewares