Skip to content

Latest commit

 

History

History
60 lines (46 loc) · 3.81 KB

README.md

File metadata and controls

60 lines (46 loc) · 3.81 KB

StabiHacks

Various utilities to deal with metadata and content provided by the Berlin State Library/Staatsbibliothek zu Berlin

The scripts work together as illustrated below:

general workflow between SBBget, OAI-Analyzer, and Fulltext statistics

SBBget

  • a Python script that is capable of downloading digitized media, the associated metadata, and its fulltext from Berlin State Library's digitized collections.
  • it also extracts images that have been detected by the OCR and stores them in the desired file format, e.g., JPEG
  • the extracted illustrations can also be stored as .tar files to facilitate distribution
  • its logic is based on the more or less unique PPN identifier used at the Berlin State Library.
  • some PPN lists are shipped for demonstration purposes. more can be obtained at the Berlin State Library or the creator of the script.
  • the script will create various folders below its current working directory, e.g.,
    • downloads (fulltexts, original digitizations etc.) are stored at: sbbget_downloads/download_temp/
    • extracted images are stored at: sbbget_downloads/extracted_images/
    • METS/MODS files are stored at: sbbget_downloads/download_temp//__metsmods/

Sample Data

  • the script comes with some sample collection that are described here

OAI-Analyzer

  • a Python script that downloads METS/MODS files via OAI-PMH and analyzes them
  • the results of the analyses are saved locally for further processing in various formats, e.g. Excel, CSV, or sqlite

Fulltext Analysis

  • a Python script that retrieves all fulltexts from a SBBget created download directory and converts all files to raw text files
  • the script works best with Python 3.6 (at least under MacOS)
  • additionally, the script runs a NER on all created raw text files and saves the results, the NER is based on flair
  • alternatively the script can operate on the result file created by OAI-Analyzer and download ALTO files directly, from this perspective it serves as a Stabi fulltext corpus builder
  • the script is based on NLTK which needs additional installation steps, i.e.:
    • install NLTK in your Python environment
    • when running the script, Python will ask you to install additional NLTK packages, the easiest way is to open a Python interpreter and run to launch NLTK's graphical installer:
    import nltk
    nltk.download()
    
    • further information can be found an online book that also gives an introduction into natural language processing

Pica Plus

  • a Python script that parses files in the Pica+ format as provided by the GBV

  • the script lets you choose interesting fields (as stored in the fieldsOfInterest list) and will output the contained data

  • records will be separated by a NEW_RECORD string on command line or by an empty line in the text format

  • output can be saved in text format, separated by the language of the record

  • standard fields are:

    • title
    • author (+ optional GND ID)
    • country of publication (only the first entry in a specific extension of the DIN ISO 3166 format)
    • publisher and place of publication
  • documentation of the Pica Plus format is only available in German here: