Various utilities to deal with metadata and content provided by the Berlin State Library/Staatsbibliothek zu Berlin
The scripts work together as illustrated below:
- a Python script that is capable of downloading digitized media, the associated metadata, and its fulltext from Berlin State Library's digitized collections.
- it also extracts images that have been detected by the OCR and stores them in the desired file format, e.g., JPEG
- the extracted illustrations can also be stored as .tar files to facilitate distribution
- its logic is based on the more or less unique PPN identifier used at the Berlin State Library.
- some PPN lists are shipped for demonstration purposes. more can be obtained at the Berlin State Library or the creator of the script.
- the script will create various folders below its current working directory, e.g.,
- downloads (fulltexts, original digitizations etc.) are stored at: sbbget_downloads/download_temp/
- extracted images are stored at: sbbget_downloads/extracted_images/
- METS/MODS files are stored at: sbbget_downloads/download_temp//__metsmods/
- the script comes with some sample collection that are described here
- a Python script that downloads METS/MODS files via OAI-PMH and analyzes them
- the results of the analyses are saved locally for further processing in various formats, e.g. Excel, CSV, or sqlite
- a Python script that retrieves all fulltexts from a SBBget created download directory and converts all files to raw text files
- the script works best with Python 3.6 (at least under MacOS)
- additionally, the script runs a NER on all created raw text files and saves the results, the NER is based on flair
- alternatively the script can operate on the result file created by OAI-Analyzer and download ALTO files directly, from this perspective it serves as a Stabi fulltext corpus builder
- the script is based on NLTK which needs additional installation steps, i.e.:
- install NLTK in your Python environment
- when running the script, Python will ask you to install additional NLTK packages, the easiest way is to open a Python interpreter and run to launch NLTK's graphical installer:
import nltk nltk.download()
- further information can be found an online book that also gives an introduction into natural language processing
-
a Python script that parses files in the Pica+ format as provided by the GBV
-
the script lets you choose interesting fields (as stored in the fieldsOfInterest list) and will output the contained data
-
records will be separated by a NEW_RECORD string on command line or by an empty line in the text format
-
output can be saved in text format, separated by the language of the record
-
standard fields are:
- title
- author (+ optional GND ID)
- country of publication (only the first entry in a specific extension of the DIN ISO 3166 format)
- publisher and place of publication
-
documentation of the Pica Plus format is only available in German here: