storm-crawler

A collection of resources for building low-latency, large scale web crawlers on Storm available under Apache License.

How to use

Available from Maven Central with :

<dependency>
    <groupId>com.digitalpebble</groupId>
    <artifactId>storm-crawler</artifactId>
    <version>0.3</version>
</dependency>

To get started with storm-crawler, it's recommended that you run the CrawlTopology in local mode.

NOTE: These instructions assume that you have Maven installed.

First, clone the project from github:

git clone https://github.com/DigitalPebble/storm-crawler

Then, run:

mvn clean compile exec:java -Dstorm.topology=com.digitalpebble.storm.crawler.CrawlTopology -Dexec.args="-conf crawler-conf.yaml -local"

Alternatively, generate an uberjar:

mvn clean package

and then submit the topology with storm jar:

storm jar target/storm-crawler-0.4-SNAPSHOT-jar-with-dependencies.jar com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml -local

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
crawler-conf.yaml		crawler-conf.yaml
eclipse-formatting-profile.xml		eclipse-formatting-profile.xml
pom.xml		pom.xml