Skip to content

Commit

Permalink
increase consistency
Browse files Browse the repository at this point in the history
  • Loading branch information
bethke committed May 9, 2018
1 parent 94fbd5c commit a73fc58
Show file tree
Hide file tree
Showing 6 changed files with 29 additions and 125 deletions.
3 changes: 1 addition & 2 deletions core_models/chunker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ In this example the sentence can be divided into 4 phrases, `The quick brown fox

We used the CoNLL2000 dataset in our example for training a phrase chunker. More info about this dataset can be found [here](https://www.clips.uantwerpen.be/conll2000/chunking/).

## Usage
### Training
Train a model with default parameters (only tokens, default network settings):
`python train.py`
Expand All @@ -24,7 +23,7 @@ Saving the model after training is done automatically:
* `<chunker>_settings.dat` - Model topology and input settings

### Inference
To run inference on a trained model one has to have a pre-trained chunker.prm and chunker_settings.dat model files.
To run inference on a trained model one has to have a pre-trained chunker.prm and chunker_settings.dat model files.

Quick example:
```
Expand Down
4 changes: 1 addition & 3 deletions core_models/kvmemn2n/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,10 @@ This directory contains an implementation of an end-to-end key-value memory netw
The idea behind this method is to be able to answer wide range of questions based on a large set of textual information, as opposed to a restricted or sparse knowledge base.

# Dataset
Please download the tar file from http://www.thespermwhale.com/jaseweston/babi/movieqa.tar.gz and expand the folder into your desired data directory or `--data_dir`.
Please download the tar file from http://www.thespermwhale.com/jaseweston/babi/movieqa.tar.gz and expand the folder into your desired data directory or `--data_dir`. The dataset can be downloaded from the command line if not found, and the preprocessing all happens at the beginning of training.

# Training
The base command to train is `python train_kvmemn2n.py`.
To get all the options run `python train_kvmemn2n.py -h`
The following are example commands to run training using knowledge base and raw text respectively
```
python train_kvmemn2n.py --epochs 2000 --batch_size 32 --emb_size 100 --use_v_luts --model_file path_to_model_dir/kb_model
Expand All @@ -17,7 +16,6 @@ python train_kvmemn2n.py --mem_mode text --epochs 2000 --batch_size 32 --emb_siz
```

# Interactive Mode

You can enter an interactive mode using the argument `--interactive`. The interactive mode can be called to launch at the end of training, or direcly after `--inference`. To run inference on the KB model from above we would call:

```
Expand Down
16 changes: 13 additions & 3 deletions core_models/most_common_word_sense/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,14 @@
The most common word sense algorithm's goal is to extract the most common sense of a target word. The input to the algorithm is the target word and the output are the senses of the target word where each sense is scored according to the most commonly used sense in the language.

## Prepare training and validation test sets
The training module inputs a gold standard csv file which is list of target_words where each word is associated with a CLASS_LABEL - a correct (true example) or an incorrect (false example) sense. the sense consists of the definition and the inherited hypernyms of the target word in a specific sense.
The user needs to prepare this gold standard csv file in advance. The file should include the following 4 columns:
| TARGET_WORD | DEFINITION | SEMANTIC_BRANCH | CLASS_LABEL
where:
1. TARGET_WORD(string):the word that you want to get the most common sense of. e.g. chair
2. DEFINITION (string): the definition of the word (usually a single sentence) extracted from external resource such as wordnet or wikidata. e.g. an articat that is design for sitting
3. SEMANTIC_BRANCH(string): [comma seoarated] the inherited hypernyms extracted from external resource such as wordnet or wikidata e.g. [funniture, articact]
4. CLASS_LABEL(string): a binary Y value 0/1 that represent whether the sense (Definition and semantic branch) is the most common sense of the target word. e.g. 1

`python prepare_data [--gold_standard_file GOLD_STANDARD_FILE]
[--word_embedding_model_file WORD_EMBEDDING_MODEL_FILE]
Expand All @@ -13,19 +21,21 @@ Train the MLP classifier and evaluate it.

`python train.py [--data_set_file DATA_SET_FILE] [--model_prm MODEL_PRM]`

### Example:
Quick example:

`python train.py --data_set_file data/data_set.pkl
--model_prm data/wsd_classification_model.prm`

## Inference
When running inference note that the results are printed to the terminal using different colors therefore using a white terminal background is best to view the results.

`python inference.py [--max_num_of_senses_to_search N]
[--input_inference_examples_file INPUT_INFERENCE_EXAMPLE_FILE]
[--model_prm MODEL_PRM] [--word_embedding_model_file WORD_EMBEDDING_MODEL_FILE]`

### Example:
Quick example:

`python inference.py --max_num_of_senses_to_search 3
--input_inference_examples_file data/input_inference_examples.csv
--word_embedding_model_file pretrained_models/GoogleNews-vectors-negative300.bin
--model_prm data/wsd_classification_model.prm`

88 changes: 6 additions & 82 deletions core_models/np2vec/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# NP2vec - Word Embedding's model training for Noun Phrases

Noun Phrases (NP) play a particular role in NLP algorithms.
This code consists in training a word embedding's model for Noun NP's using [word2vec](https://code.google.com/archive/p/word2vec/) or [fasttext](https://github.com/facebookresearch/fastText) algorithm.
This code consists in training a word embedding's model for Noun NP's using [word2vec](https://code.google.com/archive/p/word2vec/) or [fasttext](https://github.com/facebookresearch/fastText) algorithm.
It assumes that the NP's are already extracted and marked in the input corpus.
All the terms in the corpus are used as context in order to train the word embedding's model; however, at the end of the training, only the word embedding's of the NP's are stored, except for the case of
fasttext training with word_ngrams=1; in this case, we store all the word embedding's, including non-NP's in order to be able to estimate word embeddings of out-of-vocabulary NP's (NP's that don't appear in
Expand All @@ -12,10 +12,11 @@ Note that this code can be also used to train a word embedding's model on any ma
NP's have to be marked in the corpus by a marking character between the words of the NP and as a suffix of the NP.
For example, if the marking character is '\_', the NP "Natural Language Processing" will be marked as "Natural_Language_Processing_".

## Training Usage
## Training:
To train use the following command:

```
usage: main_train.py [-h] [--corpus CORPUS] [--corpus_format {json,txt}]
python main_train.py [-h] [--corpus CORPUS] [--corpus_format {json,txt}]
[--mark_char MARK_CHAR]
[--word_embedding_type {word2vec,fasttext}]
[--np2vec_model_file NP2VEC_MODEL_FILE] [--binary]
Expand All @@ -25,88 +26,11 @@ usage: main_train.py [-h] [--corpus CORPUS] [--corpus_format {json,txt}]
[--workers WORKERS] [--hs {0,1}] [--negative NEGATIVE]
[--cbow_mean {0,1}] [--iter ITER] [--min_n MIN_N]
[--max_n MAX_N] [--word_ngrams {0,1}]
optional arguments:
-h, --help show this help message and exit
--corpus CORPUS path the file with the input marked corpus
--corpus_format {json,txt}
format of the input marked corpus; txt and json
formats are supported. For json format, the file
should contain an iterable of sentences. Each sentence
is a list of terms (unicode strings) that will be used
for training.
--mark_char MARK_CHAR
special character that marks NP's suffix.
--word_embedding_type {word2vec,fasttext}
word embedding model type; word2vec and fasttext are
supported.
--np2vec_model_file NP2VEC_MODEL_FILE
path to the file where the trained np2vec model has to
be stored.
--binary boolean indicating whether the model is stored in
binary format; if word_embedding_type is fasttext and
word_ngrams is 1, binary should be set to True.
--sg {0,1} model training hyperparameter, skip-gram. Defines the
training algorithm. If 1, CBOW is used, otherwise,
skip-gram is employed.
--size SIZE model training hyperparameter, size of the feature
vectors.
--window WINDOW model training hyperparameter, maximum distance
between the current and predicted word within a
sentence.
--alpha ALPHA model training hyperparameter. The initial learning
rate.
--min_alpha MIN_ALPHA
model training hyperparameter. Learning rate will
linearly drop to `min_alpha` as training progresses.
--min_count MIN_COUNT
model training hyperparameter, ignore all words with
total frequency lower than this.
--sample SAMPLE model training hyperparameter, threshold for
configuring which higher-frequency words are randomly
downsampled, useful range is (0, 1e-5)
--workers WORKERS model training hyperparameter, number of worker
threads.
--hs {0,1} model training hyperparameter, hierarchical softmax.
If set to 1, hierarchical softmax will be used for
model training. If set to 0, and `negative` is non-
zero, negative sampling will be used.
--negative NEGATIVE model training hyperparameter, negative sampling. If >
0, negative sampling will be used, the int for
negative specifies how many "noise words" should be
drawn (usually between 5-20). If set to 0, no negative
sampling is used.
--cbow_mean {0,1} model training hyperparameter. If 0, use the sum of
the context word vectors. If 1, use the mean, only
applies when cbow is used.
--iter ITER model training hyperparameter, number of iterations.
--min_n MIN_N fasttext training hyperparameter. Min length of char
ngrams to be used for training word representations.
--max_n MAX_N fasttext training hyperparameter. Max length of char
ngrams to be used for training word representations.
Set `max_n` to be lesser than `min_n` to avoid char
ngrams being used.
--word_ngrams {0,1} fasttext training hyperparameter. If 1, uses enrich
word vectors with subword (ngrams) information. If 0,
this is equivalent to word2vec training.
```

## Inference Usage

## Inference:
Inference on a model can then be completed using:
```
usage: main_inference.py [-h] [--np2vec_model_file NP2VEC_MODEL_FILE]
[--binary] [--word_ngrams {0,1}]
optional arguments:
-h, --help show this help message and exit
--np2vec_model_file NP2VEC_MODEL_FILE
path to the file with the np2vec model to load.
--binary boolean indicating whether the model to load has been
stored in binary format.
--word_ngrams {0,1} If 0, the model to load stores word information. If 1,
the model to load stores subword (ngrams) information;
note that subword information is relevant only to
fasttext models.
```

38 changes: 8 additions & 30 deletions core_models/np_semantic_segmentation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,9 @@ This model trains MLP classifier and inference from such classifier in order to
for the given NP.

## Dataset
You can download Tratz 2011 et al. dataset [1,2] from the following link:
The expected dataset is a CSV file with 2 columns. the first column contains the Noun-Phrase string (a Noun-Phrase containing 2 words), and the second column contains the correct label (if the 2 word Noun-Phrase is a collocation - the label is 1, else 0)

If you wish to use an existing dataset for training the model, you can download Tratz 2011 et al. dataset [1,2] from the following link:
[Tratz 2011 Dataset](https://vered1986.github.io/papers/Tratz2011_Dataset.tar.gz)

After downloading and unzipping the dataset, run `preprocess_tratz2011.py` in order to construct the labeled data and save it in a CSV file (as expected for the model).
Expand All @@ -28,7 +30,7 @@ Parameters can be obtained by running:


### Pre-processing the data:
A feature vector is extracted from each Noun-Phrase string:
A feature vector is extracted from each Noun-Phrase string using the command `python data.py`

* Word2Vec word embedding (300 size vector for each word in the Noun-Phrase) .
* Pre-trained Google News Word2vec model can download [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing)
Expand All @@ -37,47 +39,23 @@ A feature vector is extracted from each Noun-Phrase string:
* A binary features whether the Noun-Phrase has existing entity in Wikidata.
* A binary features whether the Noun-Phrase has existing entity in WordNet.

#### pre-processing the dataset:
Parameters can be obtained by running:

python data.py -h
--data DATA path the CSV file where the raw dataset is saved
--output OUTPUT path the CSV file where the prepared dataset will be saved
--w2v_path W2V_PATH path to the word embedding's model (default: None)
--http_proxy HTTP_PROXY system's http proxy (default: None)
--https_proxy HTTPS_PROXY system's https proxy (default: None)

Quick example:

python data.py --data input_data_path.csv --output output_prepared_path.csv --w2v_path <path_to_w2v>/GoogleNews-vectors-negative300.bin.gz

## Training:
Train the MLP classifier and evaluate it.
Parameters can be obtained by running:

python train.py -h
--data DATA Path to the CSV file where the prepared dataset is saved
--model_path MODEL_PATH Path to save the model

## Training
The command `python train.py` will train the MLP classifier and evaluate it.
After training is done, the model is saved automatically:

`<model_name>.prm` - the trained model

Quick example:

python train.py --data prepared_data_path.csv --model np_semantic_segmentation_path.prm

## Inference:
## Inference
In order to run inference you need to have pre-trained `<model_name>.prm` file and data CSV file
that was generated by `prepare_data.py`.
The result of `inference.py` is a CSV file, each row contains the model's inference in respect to the input data.
The result of `python inference.py` is a CSV file, each row contains the model's inference in respect to the input data.

python inference.py -h
--data DATA prepared data CSV file path (default: None)
--model MODEL path to the trained model file (default: None)
--print_stats PRINT_STATS print evaluation stats for the model predictions - if your data has tagging (default: False)
--output OUTPUT path to location for inference output file (default: None)
Quick example:

python inference.py --model np_semantic_segmentation_path.prm --data prepared_data_path.csv --output inference_data.csv --print_stats True

Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,6 @@ https://rajpurkar.github.io/SQuAD-explorer/
Train the model using the following command
`python train.py -bgpu --gpu_id 0`

The command line options available are:
- `--gpu_id` select the gpu id train the model. Default is 0.
- `--max_para_req` enter the max length of the para to truncate the dataset.Default is 100. Currently the code has been tested for a maximum length of paragraph length of 100.
- `--batch_size_squad` enter the batch size (please note that 30 is the max batch size that will fit on the gpu with 12 gb memory). Default is 16.

## Results
After training starts, you will see outputs as shown below:
```
Expand Down

0 comments on commit a73fc58

Please sign in to comment.