A collection of resources for building low-latency, large scale web crawlers on Storm available under Apache License.
Available from Maven Central with :
<dependency>
<groupId>com.digitalpebble</groupId>
<artifactId>storm-crawler</artifactId>
<version>0.3</version>
</dependency>
To get started with storm-crawler, it's recommended that you run the CrawlTopology in local mode.
NOTE: These instructions assume that you have Maven installed.
First, clone the project from github:
git clone https://github.com/DigitalPebble/storm-crawler
Then, run:
mvn clean compile exec:java -Dstorm.topology=com.digitalpebble.storm.crawler.CrawlTopology -Dexec.args="-conf crawler-conf.yaml -local"
Alternatively, generate an uberjar:
mvn clean package
and then submit the topology with storm jar
:
storm jar target/storm-crawler-0.4-SNAPSHOT-jar-with-dependencies.jar com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml -local
Mailing list : http://groups.google.com/group/digitalpebble