Skip to content

Heritrix3 crawl log analysis for bulk fulltext crawls

Notifications You must be signed in to change notification settings

bnewbold/arabesque

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

                      _                            
                     | |                           
     __,   ,_    __, | |   _   ,   __,          _  
    /  |  /  |  /  | |/ \_|/  / \_/  |  |   |  |/  
    \_/|_/   |_/\_/|_/\_/ |__/ \/ \_/|_/ \_/|_/|__/
                                     |\            
                                     |/            

A simple python3 script to summarize Heritrix3 web crawl logs for a particular style of crawl: fetching large numbers of files associated with a persistent identifier. For example, crawling tens of millions of Open Access PDFs (via direct link or landing page URL) associated with a DOI.

Output is a (large) sqlite3 database file. Combine with sqlite-notebook to generate HTML reports:

https://github.com/bnewbold/sqlite-notebook

The simplest usage is to specify a seed-url/identifier mapping, a crawl log, and an output database file name:

./arabesque.py everything examples/crawl.log examples/seed_doi.tsv output.sqlite3
./arabesque.py postprocess examples/grobid_status_codes.tsv output.sqlite3

Then generate an HTML report:

sqlite-notebook.py examples/report_template.md output.sqlite3 > report.html

The core feature of this script to is resolve HTTP redirect chains. In the "backward" mode, all terminal responses (HTTP 200) that are in-scope (by mimetype) are resolved back to their original seed URL. There may be multiple in-scope terminal responses per seed (eg, via embeds or other URL extraction beans). In the "forward" mode, redirects are resolved to a single terminal response (if there is one), which may be 4xx, 5xx, or other failure response code.

The result is a single summary table with the following SQL schema:

CREATE TABLE IF NOT EXISTS crawl_result
    (initial_url text NOT NULL,
     identifier text,
     initial_domain text,
     breadcrumbs text,
     final_url text,
     final_domain text text,
     final_timestamp text,
     final_status_code text,
     final_sha1 text,
     final_mimetype text,
     final_was_dedupe bool,
     hit bool);

There aren't many tests, but what there is can be run with:

pytest-3 arabesque.py

About

Heritrix3 crawl log analysis for bulk fulltext crawls

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages