increase consistency

tnmurthy · May 9, 2018 · a73fc58 · a73fc58
1 parent 94fbd5c
commit a73fc58
Show file tree

Hide file tree

Showing 6 changed files with 29 additions and 125 deletions.
diff --git a/core_models/chunker/README.md b/core_models/chunker/README.md
@@ -14,7 +14,6 @@ In this example the sentence can be divided into 4 phrases, `The quick brown fox
 
 We used the CoNLL2000 dataset in our example for training a phrase chunker. More info about this dataset can be found [here](https://www.clips.uantwerpen.be/conll2000/chunking/).
 
-## Usage
 ### Training
 Train a model with default parameters (only tokens, default network settings): 
  `python train.py`
@@ -24,7 +23,7 @@ Saving the model after training is done automatically:
 * `<chunker>_settings.dat` - Model topology and input settings
 
 ### Inference
-To run inference on a trained model one has to have a pre-trained chunker.prm and chunker_settings.dat model files. 
+To run inference on a trained model one has to have a pre-trained chunker.prm and chunker_settings.dat model files.
 
 Quick example:
 ```

diff --git a/core_models/kvmemn2n/README.md b/core_models/kvmemn2n/README.md
@@ -3,11 +3,10 @@ This directory contains an implementation of an end-to-end key-value memory netw
 The idea behind this method is to be able to answer wide range of questions based on a large set of textual information, as opposed to a restricted or sparse knowledge base.
 
 # Dataset
-Please download the tar file from http://www.thespermwhale.com/jaseweston/babi/movieqa.tar.gz and expand the folder into your desired data directory or `--data_dir`. 
+Please download the tar file from http://www.thespermwhale.com/jaseweston/babi/movieqa.tar.gz and expand the folder into your desired data directory or `--data_dir`. The dataset can be downloaded from the command line if not found, and the preprocessing all happens at the beginning of training.
 
 # Training
 The base command to train is `python train_kvmemn2n.py`.
-To get all the options run `python train_kvmemn2n.py -h`
 The following are example commands to run training using knowledge base and raw text respectively
 ```
 python train_kvmemn2n.py --epochs 2000 --batch_size 32 --emb_size 100 --use_v_luts --model_file path_to_model_dir/kb_model
@@ -17,7 +16,6 @@ python train_kvmemn2n.py --mem_mode text --epochs 2000 --batch_size 32 --emb_siz
 ```
 
 # Interactive Mode
-
 You can enter an interactive mode using the argument `--interactive`. The interactive mode can be called to launch at the end of training, or direcly after `--inference`. To run inference on the KB model from above we would call:
 
 ```

diff --git a/core_models/most_common_word_sense/README.md b/core_models/most_common_word_sense/README.md
@@ -2,6 +2,14 @@
 The most common word sense algorithm's goal is to extract the most common sense of a target word. The input to the algorithm is the target word and the output are the senses of the target word where each sense is scored according to the most commonly used sense in the language.
 
 ## Prepare training and validation test sets
+The training module inputs a gold standard csv file which is list of target_words where each word is associated with a CLASS_LABEL - a correct (true example) or an incorrect (false example) sense. the sense consists of the definition and the inherited hypernyms of the target word in a specific sense.
+The user needs to prepare this gold standard csv file in advance. The file should include the following 4 columns:
+| TARGET_WORD | DEFINITION | SEMANTIC_BRANCH | CLASS_LABEL
+where:
+1. TARGET_WORD(string):the word that you want to get the most common sense of. e.g. chair
+2. DEFINITION (string): the definition of the word (usually a single sentence) extracted from external resource such as wordnet or wikidata. e.g. an articat that is design for sitting
+3. SEMANTIC_BRANCH(string): [comma seoarated] the inherited hypernyms extracted from external resource such as wordnet or wikidata e.g. [funniture, articact]
+4. CLASS_LABEL(string): a binary Y value 0/1 that represent whether the sense (Definition and semantic branch) is the most common sense of the target word. e.g. 1
 
 `python prepare_data [--gold_standard_file GOLD_STANDARD_FILE]
  [--word_embedding_model_file WORD_EMBEDDING_MODEL_FILE]
@@ -13,19 +21,21 @@ Train the MLP classifier and evaluate it.
 
 `python train.py [--data_set_file DATA_SET_FILE] [--model_prm MODEL_PRM]`
 
-### Example:
+Quick example:
+
 `python train.py --data_set_file data/data_set.pkl
  --model_prm data/wsd_classification_model.prm`
 
 ## Inference
+When running inference note that the results are printed to the terminal using different colors therefore using a white terminal background is best to view the results.
+
 `python inference.py [--max_num_of_senses_to_search N]
  [--input_inference_examples_file INPUT_INFERENCE_EXAMPLE_FILE]
  [--model_prm MODEL_PRM] [--word_embedding_model_file WORD_EMBEDDING_MODEL_FILE]`
 
-### Example:
+Quick example:
 
 `python inference.py --max_num_of_senses_to_search 3
  --input_inference_examples_file data/input_inference_examples.csv
  --word_embedding_model_file pretrained_models/GoogleNews-vectors-negative300.bin
  --model_prm data/wsd_classification_model.prm`
-
diff --git a/core_models/np2vec/README.md b/core_models/np2vec/README.md
@@ -1,7 +1,7 @@
 # NP2vec - Word Embedding's model training for Noun Phrases
 
 Noun Phrases (NP) play a particular role in NLP algorithms.
-This code consists in training a word embedding's model for Noun NP's using [word2vec](https://code.google.com/archive/p/word2vec/) or [fasttext](https://github.com/facebookresearch/fastText) algorithm. 
+This code consists in training a word embedding's model for Noun NP's using [word2vec](https://code.google.com/archive/p/word2vec/) or [fasttext](https://github.com/facebookresearch/fastText) algorithm.
 It assumes that the NP's are already extracted and marked in the input corpus.
 All the terms in the corpus are used as context in order to train the word embedding's model; however, at the end of the training, only the word embedding's of the NP's are stored, except for the case of
 fasttext training with word_ngrams=1; in this case, we store all the word embedding's, including non-NP's in order to be able to estimate word embeddings of out-of-vocabulary NP's (NP's that don't appear in
@@ -12,10 +12,11 @@ Note that this code can be also used to train a word embedding's model on any ma
 NP's have to be marked in the corpus by a marking character between the words of the NP and as a suffix of the NP.
 For example, if the marking character is '\_', the NP "Natural Language Processing" will be marked as "Natural_Language_Processing_".
 
-## Training Usage
+## Training:
+To train use the following command:
 
 ```
-usage: main_train.py [-h] [--corpus CORPUS] [--corpus_format {json,txt}]
+python main_train.py [-h] [--corpus CORPUS] [--corpus_format {json,txt}]
  [--mark_char MARK_CHAR]
  [--word_embedding_type {word2vec,fasttext}]
  [--np2vec_model_file NP2VEC_MODEL_FILE] [--binary]
@@ -25,88 +26,11 @@ usage: main_train.py [-h] [--corpus CORPUS] [--corpus_format {json,txt}]
  [--workers WORKERS] [--hs {0,1}] [--negative NEGATIVE]
  [--cbow_mean {0,1}] [--iter ITER] [--min_n MIN_N]
  [--max_n MAX_N] [--word_ngrams {0,1}]
-
-optional arguments:
- -h, --help show this help message and exit
- --corpus CORPUS path the file with the input marked corpus
- --corpus_format {json,txt}
- format of the input marked corpus; txt and json
- formats are supported. For json format, the file
- should contain an iterable of sentences. Each sentence
- is a list of terms (unicode strings) that will be used
- for training.
- --mark_char MARK_CHAR
- special character that marks NP's suffix.
- --word_embedding_type {word2vec,fasttext}
- word embedding model type; word2vec and fasttext are
- supported.
- --np2vec_model_file NP2VEC_MODEL_FILE
- path to the file where the trained np2vec model has to
- be stored.
- --binary boolean indicating whether the model is stored in
- binary format; if word_embedding_type is fasttext and
- word_ngrams is 1, binary should be set to True.
- --sg {0,1} model training hyperparameter, skip-gram. Defines the
- training algorithm. If 1, CBOW is used, otherwise,
- skip-gram is employed.
- --size SIZE model training hyperparameter, size of the feature
- vectors.
- --window WINDOW model training hyperparameter, maximum distance
- between the current and predicted word within a
- sentence.
- --alpha ALPHA model training hyperparameter. The initial learning
- rate.
- --min_alpha MIN_ALPHA
- model training hyperparameter. Learning rate will
- linearly drop to `min_alpha` as training progresses.
- --min_count MIN_COUNT
- model training hyperparameter, ignore all words with
- total frequency lower than this.
- --sample SAMPLE model training hyperparameter, threshold for
- configuring which higher-frequency words are randomly
- downsampled, useful range is (0, 1e-5)
- --workers WORKERS model training hyperparameter, number of worker
- threads.
- --hs {0,1} model training hyperparameter, hierarchical softmax.
- If set to 1, hierarchical softmax will be used for
- model training. If set to 0, and `negative` is non-
- zero, negative sampling will be used.
- --negative NEGATIVE model training hyperparameter, negative sampling. If >
- 0, negative sampling will be used, the int for
- negative specifies how many "noise words" should be
- drawn (usually between 5-20). If set to 0, no negative
- sampling is used.
- --cbow_mean {0,1} model training hyperparameter. If 0, use the sum of
- the context word vectors. If 1, use the mean, only
- applies when cbow is used.
- --iter ITER model training hyperparameter, number of iterations.
- --min_n MIN_N fasttext training hyperparameter. Min length of char
- ngrams to be used for training word representations.
- --max_n MAX_N fasttext training hyperparameter. Max length of char
- ngrams to be used for training word representations.
- Set `max_n` to be lesser than `min_n` to avoid char
- ngrams being used.
- --word_ngrams {0,1} fasttext training hyperparameter. If 1, uses enrich
- word vectors with subword (ngrams) information. If 0,
- this is equivalent to word2vec training.
 ```
 
-## Inference Usage
-
+## Inference:
+Inference on a model can then be completed using:
 ```
 usage: main_inference.py [-h] [--np2vec_model_file NP2VEC_MODEL_FILE]
  [--binary] [--word_ngrams {0,1}]
-
-optional arguments:
- -h, --help show this help message and exit
- --np2vec_model_file NP2VEC_MODEL_FILE
- path to the file with the np2vec model to load.
- --binary boolean indicating whether the model to load has been
- stored in binary format.
- --word_ngrams {0,1} If 0, the model to load stores word information. If 1,
- the model to load stores subword (ngrams) information;
- note that subword information is relevant only to
- fasttext models.
 ```
-  
-
diff --git a/core_models/np_semantic_segmentation/README.md b/core_models/np_semantic_segmentation/README.md
@@ -15,7 +15,9 @@ This model trains MLP classifier and inference from such classifier in order to
 for the given NP.
 
 ## Dataset
-You can download Tratz 2011 et al. dataset [1,2] from the following link:
+The expected dataset is a CSV file with 2 columns. the first column contains the Noun-Phrase string (a Noun-Phrase containing 2 words), and the second column contains the correct label (if the 2 word Noun-Phrase is a collocation - the label is 1, else 0)
+
+If you wish to use an existing dataset for training the model, you can download Tratz 2011 et al. dataset [1,2] from the following link:
 [Tratz 2011 Dataset](https://vered1986.github.io/papers/Tratz2011_Dataset.tar.gz)
 
 After downloading and unzipping the dataset, run `preprocess_tratz2011.py` in order to construct the labeled data and save it in a CSV file (as expected for the model).
@@ -28,7 +30,7 @@ Parameters can be obtained by running:
 
 
 ### Pre-processing the data:
-A feature vector is extracted from each Noun-Phrase string:
+A feature vector is extracted from each Noun-Phrase string using the command `python data.py`
 
 * Word2Vec word embedding (300 size vector for each word in the Noun-Phrase) .
  * Pre-trained Google News Word2vec model can download [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing)
@@ -37,47 +39,23 @@ A feature vector is extracted from each Noun-Phrase string:
 * A binary features whether the Noun-Phrase has existing entity in Wikidata.
 * A binary features whether the Noun-Phrase has existing entity in WordNet.
 
-#### pre-processing the dataset:
-Parameters can be obtained by running:
-
- python data.py -h
-  --data DATA path the CSV file where the raw dataset is saved
-  --output OUTPUT path the CSV file where the prepared dataset will be saved
-  --w2v_path W2V_PATH path to the word embedding's model (default: None)
-  --http_proxy HTTP_PROXY system's http proxy (default: None)
-  --https_proxy HTTPS_PROXY system's https proxy (default: None)
-
 Quick example:
 
  python data.py --data input_data_path.csv --output output_prepared_path.csv --w2v_path <path_to_w2v>/GoogleNews-vectors-negative300.bin.gz
 
-## Training:
-Train the MLP classifier and evaluate it.
-Parameters can be obtained by running:
-
- python train.py -h
-  --data DATA Path to the CSV file where the prepared dataset is saved
-  --model_path MODEL_PATH Path to save the model
-
+## Training
+The command `python train.py` will train the MLP classifier and evaluate it.
 After training is done, the model is saved automatically:
 
-`<model_name>.prm` - the trained model
-
 Quick example:
 
  python train.py --data prepared_data_path.csv --model np_semantic_segmentation_path.prm
 
-## Inference:
+## Inference
 In order to run inference you need to have pre-trained `<model_name>.prm` file and data CSV file
 that was generated by `prepare_data.py`.
-The result of `inference.py` is a CSV file, each row contains the model's inference in respect to the input data.
+The result of `python inference.py` is a CSV file, each row contains the model's inference in respect to the input data.
 
- python inference.py -h
-  --data DATA prepared data CSV file path (default: None)
-  --model MODEL path to the trained model file (default: None)
-  --print_stats PRINT_STATS print evaluation stats for the model predictions - if your data has tagging (default: False)
-  --output OUTPUT path to location for inference output file (default: None)
 Quick example:
 
  python inference.py --model np_semantic_segmentation_path.prm --data prepared_data_path.csv --output inference_data.csv --print_stats True
-
diff --git a/core_models/reading_comprehension/ngraph_implementation/README.md b/core_models/reading_comprehension/ngraph_implementation/README.md
@@ -12,11 +12,6 @@ https://rajpurkar.github.io/SQuAD-explorer/
 Train the model using the following command
  `python train.py -bgpu --gpu_id 0`
 
-The command line options available are:
-- `--gpu_id` select the gpu id train the model. Default is 0.
-- `--max_para_req` enter the max length of the para to truncate the dataset.Default is 100. Currently the code has been tested for a maximum length of paragraph length of 100.
-- `--batch_size_squad` enter the batch size (please note that 30 is the max batch size that will fit on the gpu with 12 gb memory). Default is 16.
-
 ## Results
 After training starts, you will see outputs as shown below:
 ```