- OpenJDK 8
./gradlew
Edit conf/hbase/hbase-site.xml
Please make sure you specify a valid hbase directory path and a valid zookeeper data directory path.
<property>
<name>hbase.rootdir</name>
<value>file:///home/USER_NAME/hbase-data</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/USER_NAME/zookeeper-data</value>
</property>
This is where hbase data and zookeeper data will be stored. You can then use the built-in gradle task to setup Hbase.
./gradlew setupHbase
This will download hbase-0.98.8-hadoop2 in build directory.
Now, startup hbase service by going into the hbase directory - $projectDir/build/hbase-0.98.8-hadoop2/bin/ and run -
./start-hbase.sh
Download and extract apache-nutch-2.3.1 in the build directory -
./gradlew setupNutch
Then edit conf/nutch-discovery/nutch-site.xml
with discovery credentials,
<property>
<name>discovery.endpoint</name>
<value></value>
</property>
<property>
<name>discovery.username</name>
<value></value>
</property>
<property>
<name>discovery.password</name>
<value></value>
</property>
<property>
<name>discovery.configuration.id</name>
<value></value>
</property>
<property>
<name>discovery.environment.id</name>
<value></value>
</property>
<property>
<name>discovery.api.version</name>
<value></value>
</property>
<property>
<name>discovery.collection.id</name>
<value></value>
</property>
To build the plugin, run
./gradlew buildPlugin
This will take about 4-5 minutes to complete. That's it. Everything is now setup to crawl websites.
- Edit the textfile
seed/urls.txt
containing a list of seed URLs.
$ mkdir seed
$ echo "https://en.wikipedia.org/wiki/Apache_Nutch" >> $projectDir/seed/urls.txt
- inject them into Nutch by giving a file URL (!)
$projectDir/build/apache-nutch-2.3.1/runtime/local/bin/nutch inject file:///path/to/seed/
- Generate a new set of URLs to fetch. This is is based on both the injected URLs as well as outdated URLs in the Nutch crawl db.
$projectDir/build/apache-nutch-2.3.1/runtime/local/bin/nutch generate -topN 5
The above command will create job batches for 5 URLs. You can also increase/decrease this number.
- Fetch the URLs. We are not clustering, so we can simply fetch all batches:
$projectDir/build/apache-nutch-2.3.1/runtime/local/bin/nutch fetch -all
- Now we parse all fetched pages:
$projectDir/build/apache-nutch-2.3.1/runtime/local/bin/nutch parse -all
- Last step: Update Nutch's internal database:
$projectDir/build/apache-nutch-2.3.1/runtime/local/bin/nutch updatedb -all
On the first run, this will only crawl the injected URLs. The procedure above is supposed to be repeated regulargy to keep the index up to date.
$projectDir/build/apache-nutch-2.3.1/runtime/local/bin/nutch index -all
Look at the Nutch source in src/plugin/indexer-*
for examples of other indexer plugins.