Fix multiNLI files

kohpangwei · Mar 26, 2020 · ae0a384 · ae0a384
1 parent d5dd90b
commit ae0a384
Show file tree

Hide file tree

Showing 2 changed files with 10 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -68,7 +68,7 @@ Our code expects the following files/folders in the `[root_dir]/cub` directory:
 
 - `data/waterbird_complete95_forest2water2/`
 
-You can download a tarball of this dataset [here](https://nlp.stanford.edu/data/waterbird_complete95_forest2water2.tar.gz).
+You can download a tarball of this dataset [here](https://nlp.stanford.edu/data/dro/waterbird_complete95_forest2water2.tar.gz).
 
 A sample command to run group DRO on Waterbirds is:
 `python run_expt.py -s confounder -d CUB -t waterbird_complete95 -c forest2water2 --lr 0.001 --batch_size 128 --weight_decay 0.0001 --model resnet50 --n_epochs 300 --reweight_groups --robust --alpha 0.01 --gamma 0.1 --generalization_adjustment 0`
@@ -90,9 +90,13 @@ If you'd like to generate variants of this dataset, we have included the script
 Our code expects the following files/folders in the `[root_dir]/multinli` directory:
 
 - `data/metadata_random.csv`
-- `glue_data/MNLI/`
+- `glue_data/MNLI/cached_dev_bert-base-uncased_128_mnli`
+- `glue_data/MNLI/cached_dev_bert-base-uncased_128_mnli-mm`
+- `glue_data/MNLI/cached_train_bert-base-uncased_128_mnli`
 
-We have included the metadata file in `dataset_metadata/multinli` in this repository. The GLUE data can be downloaded with [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e); please note that you only need to download MNLI and not the other datasets. The metadata file records whether each example belongs to the train/val/test dataset as well as whether it contains a negation word.
+We have included the metadata file in `dataset_metadata/multinli` in this repository. The metadata file records whether each example belongs to the train/val/test dataset as well as whether it contains a negation word.
+
+The `glue_data/MNLI` files are generated by the [huggingface Transformers library](https://github.com/huggingface/transformers) and can be downloaded [here](https://nlp.stanford.edu/data/dro/multinli_bert_features.tar.gz).
 
 A sample command to run group DRO on MultiNLI is:
 `python run_expt.py -s confounder -d MultiNLI -t gold_label_random -c sentence2_has_negation --lr 2e-05 --batch_size 32 --weight_decay 0 --model bert --n_epochs 20`

diff --git a/data/multinli_dataset.py b/data/multinli_dataset.py
@@ -81,9 +81,9 @@ def __init__(self, root_dir,
         # Load features
         self.features_array = []
         for feature_file in [
-            'cached_train_bert-base-uncased_128_mnli',   # Train
-            'cached_dev_bert-base-uncased_128_mnli',    # Val
-            'cached_dev_bert-base-uncased_128_mnli-mm'  # Test
+            'cached_train_bert-base-uncased_128_mnli',  
+            'cached_dev_bert-base-uncased_128_mnli',
+            'cached_dev_bert-base-uncased_128_mnli-mm'
             ]:
 
             features = torch.load(