Name		Name	Last commit message	Last commit date
parent directory ..
openstax_book_import		openstax_book_import
README.md		README.md
collect_terms.py		collect_terms.py
data_processing_utils.py		data_processing_utils.py
exclude_term_list.txt		exclude_term_list.txt
generate_term_pairs.py		generate_term_pairs.py
life_bio_ch39_hand_labelled.txt		life_bio_ch39_hand_labelled.txt
life_bio_ch39_hand_labelled_terms.txt		life_bio_ch39_hand_labelled_terms.txt
parse_life_bio_sentences.py		parse_life_bio_sentences.py
prep_life_bio_kb.py		prep_life_bio_kb.py
preprocess_kb_bio101_terms.py		preprocess_kb_bio101_terms.py
preprocess_textbooks.py		preprocess_textbooks.py
split_relation_extraction_data.py		split_relation_extraction_data.py
split_term_extraction_data.py		split_term_extraction_data.py
tag_sentences.py		tag_sentences.py
test_data_processing_utils.py		test_data_processing_utils.py

README.md

Data Preprocessing Pipeline

This folder holds scripts that preprocess the textbooks and knowledge base used as training/evaluation data for term and relation extraction. All scripts should be run in the order described here to ensure all dependencies for each script are met.

Extract Sentences from Raw Book Formats

These scripts turn raw HTML/cnxml versions of the books into separate cleaned sentences:

parse_life_bio_sentences.py: Extracts sentences from html version of Life Biology textbook
openstax_book_import: This folder holds scripts that extract sentences from cnxml versions of multiple OpenStax textbooks. The parse_sentences.py script must be run with each Openstax book specified in the book_title variable.

Collect Term List and Preprocessed Sentences

These scripts create a full comprehensive term list and clean textbooks sentences that have both been preprocessed with Spacy:

preprocess_textbooks.py: Extracts key terms from the textbooks parse into separate text files and filters textbooks sentences to relevant sentences and saves them out with spacy preprocessing
preprocess_kb_bio101_terms.py: Pulls out terms from the Bio101 knowledge base into a separate text file
Manually curated hand labelled term lists should be manually copied to the data/preprocessed/terms folder at this point before running the next script.
collect_terms.py: Assembles all terms from all the different sources into a single dataframe and Spacy preprocesses the terms. Previous two scripts must be run first and any hand-labelled term lists need to be manually copied in to the data/preprocessed/terms folder.

Create Term Extraction Data

These scripts result in a train/dev/test split for term extraction modelling:

tag_sentences.py: This tags every term in each sentence for a set of textbooks. The result is a set of sentences each with a set of tagged terms for each sentence. Set tag_type = "term_extraction" to use the set of textbooks/terms for term extraction.
split_term_extraction_data.py: This takes the tagged sentences and assembles them into train, dev, and test sets ensuring there is no overlap between the train and dev/test sets.

Create Relation Extraction Data

These scripts result in a train/dev/test split for relation extraction modelling:

tag_sentences.py: This tags every term in each sentence for a set of textbooks. The result is a set of sentences each with a set of tagged terms for each sentence. Set tag_type = "relation_extraction" to use the set of textbooks/terms for relation extraction.
generate_term_pairs: This script takes the tagged sentences and enumerates all term pairs in the sentences to create a new dataframe with one row for every sentence/term pair pairing.
split_relation_extraction_data.py: This takes the sentence/term pairs and assembles them into train, dev, and test sets ensuring there is no overlap between the train and dev/test sets.

Helper Scripts

data_processing_utils.py: This contains important data processing utility functions used by the above scripts. The most important of which is the term tagging function that has a bunch of functionality for ensuring proper tagging of terms within sentences. test_data_processing_utils.py: Tests for the utils functions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preprocessing

preprocessing

README.md

Data Preprocessing Pipeline

Extract Sentences from Raw Book Formats

Collect Term List and Preprocessed Sentences

Create Term Extraction Data

Create Relation Extraction Data

Helper Scripts

Files

preprocessing

Directory actions

More options

Directory actions

More options

Latest commit

History

preprocessing

Folders and files

parent directory

README.md

Data Preprocessing Pipeline

Extract Sentences from Raw Book Formats

Collect Term List and Preprocessed Sentences

Create Term Extraction Data

Create Relation Extraction Data

Helper Scripts