Skip to content

Commit

Permalink
Refactoring COVID documentation and replication scripts (castorini#1417)
Browse files Browse the repository at this point in the history
+ Reorganized my Dropbox, changed links
+ Updated indexing documentation and scripts to fix underlying changes at AI2
  • Loading branch information
lintool authored Nov 20, 2020
1 parent 0762352 commit 6a69f2c
Show file tree
Hide file tree
Showing 7 changed files with 130 additions and 157 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,10 +86,10 @@ For the most part, these runs are based on [_default_ parameter settings](https:
The experiments described below are not associated with rigorous end-to-end regression testing and thus provide a lower standard of replicability.
For the most part, manual copying and pasting of commands into a shell is required to replicate our results:

+ [Working with AI2's COVID-19 Open Research Dataset](docs/experiments-cord19.md)
+ [Ingesting AI2's COVID-19 Open Research Dataset into Solr and Elasticsearch](docs/experiments-cord19-extras.md)
+ [Indexing AI2's COVID-19 Open Research Dataset](docs/experiments-cord19.md)
+ [Baselines for the TREC-COVID Challenge](docs/experiments-covid.md)
+ [Baselines for the TREC-COVID Challenge using doc2query](docs/experiments-covid-doc2query.md)
+ [Ingesting AI2's COVID-19 Open Research Dataset into Solr and Elasticsearch](docs/experiments-cord19-extras.md)
+ [Working with the 20 Newsgroups Dataset](docs/experiments-20newsgroups.md)
+ [Replicating "Neural Hype" Experiments](docs/experiments-forum2018.md)
+ [Guide to BM25 baselines for the MS MARCO Passage Ranking Task](docs/experiments-msmarco-passage.md)
Expand Down
37 changes: 0 additions & 37 deletions bin/download_and_index_cord19.sh

This file was deleted.

122 changes: 62 additions & 60 deletions docs/experiments-cord19.md

Large diffs are not rendered by default.

49 changes: 29 additions & 20 deletions docs/experiments-covid.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,14 @@
This document describes various baselines for the [TREC-COVID Challenge](https://ir.nist.gov/covidSubmit/), which uses the [COVID-19 Open Research Dataset (CORD-19)](https://pages.semanticscholar.org/coronavirus-research) from the [Allen Institute for AI](https://allenai.org/).
Here, we focus on running retrieval experiments; for basic instructions on building Anserini indexes, see [this page](experiments-cord19.md).

## Quick Links

+ [Round 5](#round-5)
+ [Round 4](#round-4)
+ [Round 3](#round-3)
+ [Round 2](#round-2) ([Replication Commands](#round-2-replication-commands))
+ [Round 1](#round-1) ([Replication Commands](#round-1-replication-commands))

## Round 5

These are runs that can be easily replicated with Anserini, from pre-built indexes available [here](experiments-cord19.md#pre-built-indexes-all-versions) (version from 2020/07/16, the official corpus used in round 5).
Expand Down Expand Up @@ -391,12 +399,17 @@ Exact commands for replicating these runs are found [further down on this page](

## Round 2: Replication Commands

Here are the replication commands for the individual runs:
Here are the replication commands for the individual runs.

```bash
wget https://www.dropbox.com/s/wxjoe4g71zt5za2/lucene-index-cord19-abstract-2020-05-01.tar.gz
tar xvfz lucene-index-cord19-abstract-2020-05-01.tar.gz
First, download the pre-built indexes using our script:

```
python src/main/python/trec-covid/download_indexes.py --date 2020-05-01
```

Abstract runs:

```
target/appassembler/bin/SearchCollection -index indexes/lucene-index-cord19-abstract-2020-05-01 \
-topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round2.xml -topicfield query+question \
-output runs/anserini.covid-r2.abstract.qq.bm25.txt -runtag anserini.covid-r2.abstract.qq.bm25.txt \
Expand All @@ -412,10 +425,11 @@ tools/eval/trec_eval.9.0.4/trec_eval -c -M1000 -m all_trec src/main/resources/to
python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/qrels.covid-round1.txt --cutoffs 10 --run runs/anserini.covid-r2.abstract.qq.bm25.txt
python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/qrels.covid-round1.txt --cutoffs 10 --run runs/anserini.covid-r2.abstract.qdel.bm25.txt
```

wget https://www.dropbox.com/s/di27r5o2g5kat5k/lucene-index-cord19-full-text-2020-05-01.tar.gz
tar xvfz lucene-index-cord19-full-text-2020-05-01.tar.gz
Full-text runs:

```
target/appassembler/bin/SearchCollection -index indexes/lucene-index-cord19-full-text-2020-05-01 \
-topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round2.xml -topicfield query+question \
-output runs/anserini.covid-r2.full-text.qq.bm25.txt -runtag anserini.covid-r2.full-text.qq.bm25.txt \
Expand All @@ -431,10 +445,11 @@ tools/eval/trec_eval.9.0.4/trec_eval -c -M1000 -m all_trec src/main/resources/to
python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/qrels.covid-round1.txt --cutoffs 10 --run runs/anserini.covid-r2.full-text.qq.bm25.txt
python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/qrels.covid-round1.txt --cutoffs 10 --run runs/anserini.covid-r2.full-text.qdel.bm25.txt
```

wget https://www.dropbox.com/s/6ib71scm925mclk/lucene-index-cord19-paragraph-2020-05-01.tar.gz
tar xvfz lucene-index-cord19-paragraph-2020-05-01.tar.gz
Paragraph runs:

```
target/appassembler/bin/SearchCollection -index indexes/lucene-index-cord19-paragraph-2020-05-01 \
-topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round2.xml -topicfield query+question \
-output runs/anserini.covid-r2.paragraph.qq.bm25.txt -runtag anserini.covid-r2.paragraph.qq.bm25.txt \
Expand Down Expand Up @@ -494,13 +509,15 @@ python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/

## Round 1: Replication Commands

Here are the commands to generate the runs on the abstract index:
First, download the pre-built indexes using our script:

```bash
wget https://www.dropbox.com/s/j55t617yhvmegy8/lucene-index-covid-2020-04-10.tar.gz
```
python src/main/python/trec-covid/download_indexes.py --date 2020-04-10
```

tar xvfz lucene-index-covid-2020-04-10.tar.gz
Here are the commands to generate the runs on the abstract index:

```bash
target/appassembler/bin/SearchCollection -index indexes/lucene-index-covid-2020-04-10 \
-topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round1.xml -topicfield query -removedups \
-bm25 -output runs/run.covid-r1.abstract.query.bm25.txt
Expand Down Expand Up @@ -547,10 +564,6 @@ python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/
Here are the commands to generate the runs on the full-text index:

```bash
wget https://www.dropbox.com/s/gtq2c3xq81mjowk/lucene-index-covid-full-text-2020-04-10.tar.gz

tar xvfz lucene-index-covid-full-text-2020-04-10.tar.gz

target/appassembler/bin/SearchCollection -index indexes/lucene-index-covid-full-text-2020-04-10 \
-topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round1.xml -topicfield query -removedups \
-bm25 -output runs/run.covid-r1.full-text.query.bm25.txt
Expand Down Expand Up @@ -597,10 +610,6 @@ python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/
Here are the commands to generate the runs on the paragraph index:

```bash
wget https://www.dropbox.com/s/ivk87journyajw3/lucene-index-covid-paragraph-2020-04-10.tar.gz

tar xvfz lucene-index-covid-paragraph-2020-04-10.tar.gz

target/appassembler/bin/SearchCollection -index indexes/lucene-index-covid-paragraph-2020-04-10 \
-topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round1.xml -topicfield query \
-selectMaxPassage -bm25 -output runs/run.covid-r1.paragraph.query.bm25.txt
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/io/anserini/index/IndexReaderUtils.java
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,7 @@ public static Map<String, Long> getTermCountsWithAnalyzer(IndexReader reader, St
*
* @param reader index reader
* @param term term
* @return tthe document frequency of a term
* @return the document frequency of a term
*/
public static long getDF(IndexReader reader, String term) {
try {
Expand Down
24 changes: 15 additions & 9 deletions src/main/python/trec-covid/download_indexes.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,15 +25,21 @@


all_indexes = {
'2020-05-19': ['https://www.dropbox.com/s/3ld34ms35zfb4m9/lucene-index-cord19-abstract-2020-05-19.tar.gz?dl=1',
'https://www.dropbox.com/s/qih3tjsir3xulrn/lucene-index-cord19-full-text-2020-05-19.tar.gz?dl=1',
'https://www.dropbox.com/s/7z8szogu5neuhqe/lucene-index-cord19-paragraph-2020-05-19.tar.gz?dl=1'],
'2020-06-19': ['https://www.dropbox.com/s/bj6lx80wwiy5hxf/lucene-index-cord19-abstract-2020-06-19.tar.gz?dl=1',
'https://www.dropbox.com/s/vkhhxj8u36rgdu9/lucene-index-cord19-full-text-2020-06-19.tar.gz?dl=1',
'https://www.dropbox.com/s/yk6egw6op4jccpi/lucene-index-cord19-paragraph-2020-06-19.tar.gz?dl=1'],
'2020-07-16': ['https://www.dropbox.com/s/jza7sdesjn7iqye/lucene-index-cord19-abstract-2020-07-16.tar.gz?dl=1',
'https://www.dropbox.com/s/6jc1r2ler87q4ws/lucene-index-cord19-full-text-2020-07-16.tar.gz?dl=1',
'https://www.dropbox.com/s/egupksu09qdef06/lucene-index-cord19-paragraph-2020-07-16.tar.gz?dl=1']
'2020-04-10': ['https://www.dropbox.com/s/iebape2yfgkzkt1/lucene-index-covid-2020-04-10.tar.gz?dl=1',
'https://www.dropbox.com/s/gtq2c3xq81mjowk/lucene-index-covid-full-text-2020-04-10.tar.gz?dl=1',
'https://www.dropbox.com/s/ivk87journyajw3/lucene-index-covid-paragraph-2020-04-10.tar.gz?dl=1'],
'2020-05-01': ['https://www.dropbox.com/s/jdsc6wu0vbumpup/lucene-index-cord19-abstract-2020-05-01.tar.gz?dl=1',
'https://www.dropbox.com/s/ouvp7zyqsp9y9gh/lucene-index-cord19-full-text-2020-05-01.tar.gz?dl=1',
'https://www.dropbox.com/s/e1118vjuf58ojt4/lucene-index-cord19-paragraph-2020-05-01.tar.gz?dl=1'],
'2020-05-19': ['https://www.dropbox.com/s/7bbz6pm4rduqvx3/lucene-index-cord19-abstract-2020-05-19.tar.gz?dl=1',
'https://www.dropbox.com/s/bxhldgks1rxz4ly/lucene-index-cord19-full-text-2020-05-19.tar.gz?dl=1',
'https://www.dropbox.com/s/2ewjchln0ihm6hh/lucene-index-cord19-paragraph-2020-05-19.tar.gz?dl=1'],
'2020-06-19': ['https://www.dropbox.com/s/x8wbuy0atgnajfd/lucene-index-cord19-abstract-2020-06-19.tar.gz?dl=1',
'https://www.dropbox.com/s/tf469r70r8aigu2/lucene-index-cord19-full-text-2020-06-19.tar.gz?dl=1',
'https://www.dropbox.com/s/fr3v69vhryevwp9/lucene-index-cord19-paragraph-2020-06-19.tar.gz?dl=1'],
'2020-07-16': ['https://www.dropbox.com/s/9hfowxi7zenuaay/lucene-index-cord19-abstract-2020-07-16.tar.gz?dl=1',
'https://www.dropbox.com/s/dyd9sggrqo44d0n/lucene-index-cord19-full-text-2020-07-16.tar.gz?dl=1',
'https://www.dropbox.com/s/jdfbrnohtkrvds5/lucene-index-cord19-paragraph-2020-07-16.tar.gz?dl=1']
}


Expand Down
49 changes: 21 additions & 28 deletions src/main/python/trec-covid/index_cord19.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,38 +53,31 @@ def download_url(url, save_dir):

def download_collection(date):
print(f'Downloading CORD-19 release of {date}...')
collection_dir = f'collections/cord19-{date}'
documents_url = f'https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/{date}/document_parses.tar.gz'
documents_local = f'{collection_dir}/document_parses.tar.gz'
documents_dir_local = f'{collection_dir}/document_parses'
metadata_url = f'https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/{date}/metadata.csv'
metadata_local = f'{collection_dir}/metadata.csv'

if not os.path.isdir(collection_dir):
print(f'{collection_dir} does not exist, creating...')
os.mkdir(collection_dir)
else:
print(f'{collection_dir} already exists.')
collection_dir = f'collections/'
tarball_url = f'https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_{date}.tar.gz'
tarball_local = os.path.join(collection_dir, f'cord-19_{date}.tar.gz')

if not os.path.exists(metadata_local):
print(f'Fetching {metadata_url}...')
download_url(metadata_url, collection_dir)
if not os.path.exists(tarball_local):
print(f'Fetching {tarball_url}...')
download_url(tarball_url, collection_dir)
else:
print(f'{metadata_local} already exists, skipping download.')
print(f'{tarball_local} already exists, skipping download.')

if not os.path.exists(documents_local):
print(f'Fetching {documents_url}...')
download_url(documents_url, collection_dir)
else:
print(f'{documents_local} already exists, skipping download.')
print(f'Extracting {tarball_local} into {collection_dir}')
tarball = tarfile.open(tarball_local)
tarball.extractall(collection_dir)
tarball.close()

if not os.path.isdir(documents_dir_local):
print(f'Extracting documents into {documents_dir_local}')
tarball = tarfile.open(documents_local)
tarball.extractall(collection_dir)
tarball.close()
else:
print(f'{documents_dir_local} already exists, skipping unpacking.')
docparses = os.path.join(collection_dir, date, 'document_parses.tar.gz')
collection_base = os.path.join(collection_dir, date)

print(f'Extracting {docparses} into {collection_base}...')
tarball = tarfile.open(docparses)
tarball.extractall(collection_base)
tarball.close()

print(f'Renaming {collection_base}')
os.rename(collection_base, os.path.join(collection_dir, f'cord19-{date}'))


def build_indexes(date):
Expand Down

0 comments on commit 6a69f2c

Please sign in to comment.