Skip to content

Commit

Permalink
Added script and instructions for downloading corenlp. Added official…
Browse files Browse the repository at this point in the history
…_eval.py to default download.sh for convenience.
  • Loading branch information
ajfisch committed Jul 28, 2017
1 parent 91fab68 commit 9c851d1
Show file tree
Hide file tree
Showing 5 changed files with 64 additions and 1 deletion.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@
*~
data/
*.tar.gz
*.egg-info
*.egg-info
scripts/reader/official_eval.py
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,21 @@ drqa.tokenizer.set_default('corenlp_classpath', '/your/corenlp/classpath/*')

Ex: `export CLASSPATH=$CLASSPATH:/path/to/corenlp/download/*`.

If you do not already have a CoreNLP [download](https://stanfordnlp.github.io/CoreNLP/index.html#download) you can run:

```bash
./install_corenlp
```

_You can also specify a download location: `./install_corenlp /path/to/jars`_

Verify that it runs:
```python
from drqa.tokenizers import CoreNLPTokenizer;
tok = CoreNLPTokenizer()
tok.tokenize('hello world').words() # Should complete immediately
```

For convenience, the Document Reader, Retriever, and Pipeline modules will try to load default models if no model argument is given. See below for downloading these models.

### Trained Models and Data
Expand Down
3 changes: 3 additions & 0 deletions download.sh
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ python scripts/convert/squad.py "$DATASET_PATH/SQuAD-v1.1-train.json" "$DATASET_
wget -O "$DATASET_PATH/SQuAD-v1.1-dev.json" "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json"
python scripts/convert/squad.py "$DATASET_PATH/SQuAD-v1.1-dev.json" "$DATASET_PATH/SQuAD-v1.1-dev.txt"

# Download official eval for SQuAD
curl "https://worksheets.codalab.org/rest/bundles/0xbcd57bee090b421c982906709c8c27e1/contents/blob/" > "./scripts/reader/official_eval.py"

# Get WebQuestions train
wget -O "$DATASET_PATH/WebQuestions-train.json.bz2" "http://nlp.stanford.edu/static/software/sempre/release-emnlp2013/lib/data/webquestions/dataset_11/webquestions.examples.train.json.bz2"
bunzip2 -f "$DATASET_PATH/WebQuestions-train.json.bz2"
Expand Down
38 changes: 38 additions & 0 deletions install_corenlp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/bin/bash
# Copyright 2017-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

set -e

# By default download to the data directory I guess
read -p "Specify download path or enter to use default (data/corenlp): " path
DOWNLOAD_PATH="${path:-data/corenlp}"
echo "Will download to: $DOWNLOAD_PATH"

# Download zip, unzip
pushd "/tmp"
wget -O "stanford-corenlp-full-2017-06-09.zip" "http://nlp.stanford.edu/software/stanford-corenlp-full-2017-06-09.zip"
unzip "stanford-corenlp-full-2017-06-09.zip"
rm "stanford-corenlp-full-2017-06-09.zip"
popd

# Put jars in DOWNLOAD_PATH
mkdir -p "$DOWNLOAD_PATH"
mv "/tmp/stanford-corenlp-full-2017-06-09/"*".jar" "$DOWNLOAD_PATH/"

# Append to bashrc, instructions
while read -p "Add to ~/.bashrc CLASSPATH (recommended)? [yes/no]: " choice; do
case "$choice" in
yes )
echo "export CLASSPATH=\$CLASSPATH:$DOWNLOAD_PATH/*" >> ~/.bashrc;
break ;;
no )
break ;;
* ) echo "Please answer yes or no." ;;
esac
done

printf "\n*** NOW RUN: ***\n\nexport CLASSPATH=\$CLASSPATH:$DOWNLOAD_PATH/*\n\n****************\n"
6 changes: 6 additions & 0 deletions scripts/reader/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,12 @@ Optional arguments:

Note: The CoreNLP NER annotator is not fully deterministic (depends on the order examples are processed). Predictions may fluctuate very slightly between runs if `num-workers` > 1 and the model was trained with `use-ner` on.

Evaluation is done with the official_eval.py script from the SQuAD creators, available [here](https://worksheets.codalab.org/rest/bundles/0xbcd57bee090b421c982906709c8c27e1/contents/blob/). It is also available by default at `scripts/reader/official_eval.py` after running `./download.sh`.

```bash
python scripts/reader/official_eval.py /path/to/format/B/dataset.json /path/to/predictions/with/--official/flag/set.json
```

## Interactive

The Document Reader can also be used interactively (like the [full pipeline](../../README.md#quick-start-demo)).
Expand Down

0 comments on commit 9c851d1

Please sign in to comment.