Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
_datasets		_datasets
fulltext-tools		fulltext-tools
image-tools		image-tools
img		img
oai-analyzer		oai-analyzer
pica_plus		pica_plus
ppn_lists		ppn_lists
sbbget		sbbget
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
OCR-PPN-Liste.txt		OCR-PPN-Liste.txt
README.md		README.md
ppn-howto.md		ppn-howto.md
requirements.txt		requirements.txt

Repository files navigation

StabiHacks

Various utilities to deal with metadata and content provided by the Berlin State Library/Staatsbibliothek zu Berlin

The scripts work together as illustrated below:

SBBget

a Python script that is capable of downloading digitized media, the associated metadata, and its fulltext from Berlin State Library's digitized collections.
it also extracts images that have been detected by the OCR and stores them in the desired file format, e.g., JPEG
the extracted illustrations can also be stored as .tar files to facilitate distribution
its logic is based on the more or less unique PPN identifier used at the Berlin State Library.
some PPN lists are shipped for demonstration purposes. more can be obtained at the Berlin State Library or the creator of the script.
the script will create various folders below its current working directory, e.g.,
- downloads (fulltexts, original digitizations etc.) are stored at: sbbget_downloads/download_temp/
- extracted images are stored at: sbbget_downloads/extracted_images/
- METS/MODS files are stored at: sbbget_downloads/download_temp//__metsmods/

Sample Data

the script comes with some sample collection that are described here

OAI-Analyzer

a Python script that downloads METS/MODS files via OAI-PMH and analyzes them
the results of the analyses are saved locally for further processing in various formats, e.g. Excel, CSV, or sqlite

Fulltext Analysis

a Python script that retrieves all fulltexts from a SBBget created download directory and converts all files to raw text files
the script works best with Python 3.6 (at least under MacOS)
additionally, the script runs a NER on all created raw text files and saves the results, the NER is based on flair
alternatively the script can operate on the result file created by OAI-Analyzer and download ALTO files directly, from this perspective it serves as a Stabi fulltext corpus builder
the script is based on NLTK which needs additional installation steps, i.e.:
- install NLTK in your Python environment
- when running the script, Python will ask you to install additional NLTK packages, the easiest way is to open a Python interpreter and run to launch NLTK's graphical installer:
```
import nltk
nltk.download()
```
- further information can be found an online book that also gives an introduction into natural language processing

Pica Plus

a Python script that parses files in the Pica+ format as provided by the GBV
the script lets you choose interesting fields (as stored in the fieldsOfInterest list) and will output the contained data
records will be separated by a NEW_RECORD string on command line or by an empty line in the text format
output can be saved in text format, separated by the language of the record
standard fields are:
- title
- author (+ optional GND ID)
- country of publication (only the first entry in a specific extension of the DIN ISO 3166 format)
- publisher and place of publication
documentation of the Pica Plus format is only available in German here:
- general overview
- list of fields

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StabiHacks

SBBget

Sample Data

OAI-Analyzer

Fulltext Analysis

Pica Plus

About

Releases

Packages

Languages

License

elektrobohemian/StabiHacks

Folders and files

Latest commit

History

Repository files navigation

StabiHacks

SBBget

Sample Data

OAI-Analyzer

Fulltext Analysis

Pica Plus

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages