-
Notifications
You must be signed in to change notification settings - Fork 761
Unix Utility Scripts
Heritrix comes bundled with Unix utility scripts.
This script will bundle all resources referenced in the crawl manifest file. A bundle is an uncompressed or compressed tar ball. The directory structure of the tar ball is:
-
Top level directory (crawl name)
-
Three default subdirectories
-
Any other arbitrary subdirectories
-
Script Usage
manifest_bundle.pl crawl_name manifest_file -f output_tar_file -z [ -flag directory] -f output tar file. If omitted output to stdout. -z compress tar file with gzip. -flag is any upper case letter. Default values C, L, and are R are set to configuration, logs and reports
-
manifest-bundle.pl example
manifest_bundle.pl testcrawl crawl-manifest.txt -f /0/testcrawl/manifest-bundle.tar.gz -z -F filters
For the example above, the tar ball will contain the following directory
structure:
|- testcrawl
|- configurations
|- logs
|- reports
|- filters
This Perl script, found in (HERETRIX_HOME)/bin
recreates the hop path
to the specified URI. The hop path is the path of links (URIs) that
were followed to get to the specified URI.
Script Usage
hoppath.pl crawl.log URI_PREFIX
crawl.log Full-path to Heritrix crawl.log instance.
URI_PREFIX URI we're querying about. Must begin 'http(s)://' or 'dns:'.
Wrap this parameter in quotes to avoid shell interpretation
of any '&' present in URI_PREFIX.
hoppath.pl Example
hoppath.pl crawl.log 'http://www.house.gov/'
hoppath.pl Result
2004-02-25-02-36-06 - http://www.house.gov/house/MemberWWW_by_State.html
2004-02-25-02-36-06 L http://wwws.house.gov/search97cgi/s97_cgi
2004-02-25-03-30-38 L http://www.house.gov/
The L
in the example refers to the type of link followed.
The org.archive.crawler.util.RecoveryLogMapper
Java class is similar
to the hoppath.pl
script. It was contributed by Mike Schwartz. The
RecoveryLogMapper
parses a Heritrix recovery log file and builds maps
that allow a caller to look up any seed URI. The RecoveryLogMapper
then returns a list of all URIs successfully crawled from the seed. The
RecoveryLogMapper
also can find the seed URI from which any crawled
URI was captured.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse