Refactoring COVID documentation and replication scripts (castorini#1417)

+ Reorganized my Dropbox, changed links + Updated indexing documentation and scripts to fix underlying changes at AI2
rakeeb-hossain · Nov 20, 2020 · 6a69f2c · 6a69f2c
1 parent 0762352
commit 6a69f2c
Show file tree

Hide file tree

Showing 7 changed files with 130 additions and 157 deletions.
diff --git a/README.md b/README.md
@@ -86,10 +86,10 @@ For the most part, these runs are based on [_default_ parameter settings](https:
 The experiments described below are not associated with rigorous end-to-end regression testing and thus provide a lower standard of replicability.
 For the most part, manual copying and pasting of commands into a shell is required to replicate our results:
 
-+ [Working with AI2's COVID-19 Open Research Dataset](docs/experiments-cord19.md)
-+ [Ingesting AI2's COVID-19 Open Research Dataset into Solr and Elasticsearch](docs/experiments-cord19-extras.md)
++ [Indexing AI2's COVID-19 Open Research Dataset](docs/experiments-cord19.md)
 + [Baselines for the TREC-COVID Challenge](docs/experiments-covid.md)
 + [Baselines for the TREC-COVID Challenge using doc2query](docs/experiments-covid-doc2query.md)
++ [Ingesting AI2's COVID-19 Open Research Dataset into Solr and Elasticsearch](docs/experiments-cord19-extras.md)
 + [Working with the 20 Newsgroups Dataset](docs/experiments-20newsgroups.md)
 + [Replicating "Neural Hype" Experiments](docs/experiments-forum2018.md)
 + [Guide to BM25 baselines for the MS MARCO Passage Ranking Task](docs/experiments-msmarco-passage.md)

diff --git a/bin/download_and_index_cord19.sh b/bin/download_and_index_cord19.sh
diff --git a/docs/experiments-cord19.md b/docs/experiments-cord19.md
diff --git a/docs/experiments-covid.md b/docs/experiments-covid.md
@@ -3,6 +3,14 @@
 This document describes various baselines for the [TREC-COVID Challenge](https://ir.nist.gov/covidSubmit/), which uses the [COVID-19 Open Research Dataset (CORD-19)](https://pages.semanticscholar.org/coronavirus-research) from the [Allen Institute for AI](https://allenai.org/).
 Here, we focus on running retrieval experiments; for basic instructions on building Anserini indexes, see [this page](experiments-cord19.md).
 
+## Quick Links
+
++ [Round 5](#round-5)
++ [Round 4](#round-4)
++ [Round 3](#round-3)
++ [Round 2](#round-2) ([Replication Commands](#round-2-replication-commands))
++ [Round 1](#round-1) ([Replication Commands](#round-1-replication-commands))
+
 ## Round 5
 
 These are runs that can be easily replicated with Anserini, from pre-built indexes available [here](experiments-cord19.md#pre-built-indexes-all-versions) (version from 2020/07/16, the official corpus used in round 5).
@@ -391,12 +399,17 @@ Exact commands for replicating these runs are found [further down on this page](
 
 ## Round 2: Replication Commands
 
-Here are the replication commands for the individual runs:
+Here are the replication commands for the individual runs.
 
-```bash
-wget https://www.dropbox.com/s/wxjoe4g71zt5za2/lucene-index-cord19-abstract-2020-05-01.tar.gz
-tar xvfz lucene-index-cord19-abstract-2020-05-01.tar.gz
+First, download the pre-built indexes using our script:
+
+```
+python src/main/python/trec-covid/download_indexes.py --date 2020-05-01
+```
 
+Abstract runs:
+
+```
 target/appassembler/bin/SearchCollection -index indexes/lucene-index-cord19-abstract-2020-05-01 \
  -topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round2.xml -topicfield query+question \
  -output runs/anserini.covid-r2.abstract.qq.bm25.txt -runtag anserini.covid-r2.abstract.qq.bm25.txt \
@@ -412,10 +425,11 @@ tools/eval/trec_eval.9.0.4/trec_eval -c -M1000 -m all_trec src/main/resources/to
 
 python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/qrels.covid-round1.txt --cutoffs 10 --run runs/anserini.covid-r2.abstract.qq.bm25.txt
 python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/qrels.covid-round1.txt --cutoffs 10 --run runs/anserini.covid-r2.abstract.qdel.bm25.txt
+```
 
-wget https://www.dropbox.com/s/di27r5o2g5kat5k/lucene-index-cord19-full-text-2020-05-01.tar.gz
-tar xvfz lucene-index-cord19-full-text-2020-05-01.tar.gz
+Full-text runs:
 
+```
 target/appassembler/bin/SearchCollection -index indexes/lucene-index-cord19-full-text-2020-05-01 \
  -topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round2.xml -topicfield query+question \
  -output runs/anserini.covid-r2.full-text.qq.bm25.txt -runtag anserini.covid-r2.full-text.qq.bm25.txt \
@@ -431,10 +445,11 @@ tools/eval/trec_eval.9.0.4/trec_eval -c -M1000 -m all_trec src/main/resources/to
 
 python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/qrels.covid-round1.txt --cutoffs 10 --run runs/anserini.covid-r2.full-text.qq.bm25.txt
 python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/qrels.covid-round1.txt --cutoffs 10 --run runs/anserini.covid-r2.full-text.qdel.bm25.txt
+```
 
-wget https://www.dropbox.com/s/6ib71scm925mclk/lucene-index-cord19-paragraph-2020-05-01.tar.gz
-tar xvfz lucene-index-cord19-paragraph-2020-05-01.tar.gz
+Paragraph runs:
 
+```
 target/appassembler/bin/SearchCollection -index indexes/lucene-index-cord19-paragraph-2020-05-01 \
  -topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round2.xml -topicfield query+question \
  -output runs/anserini.covid-r2.paragraph.qq.bm25.txt -runtag anserini.covid-r2.paragraph.qq.bm25.txt \
@@ -494,13 +509,15 @@ python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/
 
 ## Round 1: Replication Commands
 
-Here are the commands to generate the runs on the abstract index:
+First, download the pre-built indexes using our script:
 
-```bash
-wget https://www.dropbox.com/s/j55t617yhvmegy8/lucene-index-covid-2020-04-10.tar.gz
+```
+python src/main/python/trec-covid/download_indexes.py --date 2020-04-10
+```
 
-tar xvfz lucene-index-covid-2020-04-10.tar.gz
+Here are the commands to generate the runs on the abstract index:
 
+```bash
 target/appassembler/bin/SearchCollection -index indexes/lucene-index-covid-2020-04-10 \
  -topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round1.xml -topicfield query -removedups \
  -bm25 -output runs/run.covid-r1.abstract.query.bm25.txt
@@ -547,10 +564,6 @@ python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/
 Here are the commands to generate the runs on the full-text index:
 
 ```bash
-wget https://www.dropbox.com/s/gtq2c3xq81mjowk/lucene-index-covid-full-text-2020-04-10.tar.gz
-
-tar xvfz lucene-index-covid-full-text-2020-04-10.tar.gz
-
 target/appassembler/bin/SearchCollection -index indexes/lucene-index-covid-full-text-2020-04-10 \
  -topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round1.xml -topicfield query -removedups \
  -bm25 -output runs/run.covid-r1.full-text.query.bm25.txt
@@ -597,10 +610,6 @@ python tools/eval/measure_judged.py --qrels src/main/resources/topics-and-qrels/
 Here are the commands to generate the runs on the paragraph index:
 
 ```bash
-wget https://www.dropbox.com/s/ivk87journyajw3/lucene-index-covid-paragraph-2020-04-10.tar.gz
-
-tar xvfz lucene-index-covid-paragraph-2020-04-10.tar.gz
-
 target/appassembler/bin/SearchCollection -index indexes/lucene-index-covid-paragraph-2020-04-10 \
  -topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round1.xml -topicfield query \
  -selectMaxPassage -bm25 -output runs/run.covid-r1.paragraph.query.bm25.txt

diff --git a/src/main/java/io/anserini/index/IndexReaderUtils.java b/src/main/java/io/anserini/index/IndexReaderUtils.java
@@ -228,7 +228,7 @@ public static Map<String, Long> getTermCountsWithAnalyzer(IndexReader reader, St
    *
    * @param reader index reader
    * @param term term
-   * @return tthe document frequency of a term
+   * @return the document frequency of a term
    */
   public static long getDF(IndexReader reader, String term) {
     try {

diff --git a/src/main/python/trec-covid/download_indexes.py b/src/main/python/trec-covid/download_indexes.py
@@ -25,15 +25,21 @@
 
 
 all_indexes = {
-    '2020-05-19': ['https://www.dropbox.com/s/3ld34ms35zfb4m9/lucene-index-cord19-abstract-2020-05-19.tar.gz?dl=1',
-                   'https://www.dropbox.com/s/qih3tjsir3xulrn/lucene-index-cord19-full-text-2020-05-19.tar.gz?dl=1',
-                   'https://www.dropbox.com/s/7z8szogu5neuhqe/lucene-index-cord19-paragraph-2020-05-19.tar.gz?dl=1'],
-    '2020-06-19': ['https://www.dropbox.com/s/bj6lx80wwiy5hxf/lucene-index-cord19-abstract-2020-06-19.tar.gz?dl=1',
-                   'https://www.dropbox.com/s/vkhhxj8u36rgdu9/lucene-index-cord19-full-text-2020-06-19.tar.gz?dl=1',
-                   'https://www.dropbox.com/s/yk6egw6op4jccpi/lucene-index-cord19-paragraph-2020-06-19.tar.gz?dl=1'],
-    '2020-07-16': ['https://www.dropbox.com/s/jza7sdesjn7iqye/lucene-index-cord19-abstract-2020-07-16.tar.gz?dl=1',
-                   'https://www.dropbox.com/s/6jc1r2ler87q4ws/lucene-index-cord19-full-text-2020-07-16.tar.gz?dl=1',
-                   'https://www.dropbox.com/s/egupksu09qdef06/lucene-index-cord19-paragraph-2020-07-16.tar.gz?dl=1']
+    '2020-04-10': ['https://www.dropbox.com/s/iebape2yfgkzkt1/lucene-index-covid-2020-04-10.tar.gz?dl=1',
+                   'https://www.dropbox.com/s/gtq2c3xq81mjowk/lucene-index-covid-full-text-2020-04-10.tar.gz?dl=1',
+                   'https://www.dropbox.com/s/ivk87journyajw3/lucene-index-covid-paragraph-2020-04-10.tar.gz?dl=1'],
+    '2020-05-01': ['https://www.dropbox.com/s/jdsc6wu0vbumpup/lucene-index-cord19-abstract-2020-05-01.tar.gz?dl=1',
+                   'https://www.dropbox.com/s/ouvp7zyqsp9y9gh/lucene-index-cord19-full-text-2020-05-01.tar.gz?dl=1',
+                   'https://www.dropbox.com/s/e1118vjuf58ojt4/lucene-index-cord19-paragraph-2020-05-01.tar.gz?dl=1'],
+    '2020-05-19': ['https://www.dropbox.com/s/7bbz6pm4rduqvx3/lucene-index-cord19-abstract-2020-05-19.tar.gz?dl=1',
+                   'https://www.dropbox.com/s/bxhldgks1rxz4ly/lucene-index-cord19-full-text-2020-05-19.tar.gz?dl=1',
+                   'https://www.dropbox.com/s/2ewjchln0ihm6hh/lucene-index-cord19-paragraph-2020-05-19.tar.gz?dl=1'],
+    '2020-06-19': ['https://www.dropbox.com/s/x8wbuy0atgnajfd/lucene-index-cord19-abstract-2020-06-19.tar.gz?dl=1',
+                   'https://www.dropbox.com/s/tf469r70r8aigu2/lucene-index-cord19-full-text-2020-06-19.tar.gz?dl=1',
+                   'https://www.dropbox.com/s/fr3v69vhryevwp9/lucene-index-cord19-paragraph-2020-06-19.tar.gz?dl=1'],
+    '2020-07-16': ['https://www.dropbox.com/s/9hfowxi7zenuaay/lucene-index-cord19-abstract-2020-07-16.tar.gz?dl=1',
+                   'https://www.dropbox.com/s/dyd9sggrqo44d0n/lucene-index-cord19-full-text-2020-07-16.tar.gz?dl=1',
+                   'https://www.dropbox.com/s/jdfbrnohtkrvds5/lucene-index-cord19-paragraph-2020-07-16.tar.gz?dl=1']
 }
 
 

diff --git a/src/main/python/trec-covid/index_cord19.py b/src/main/python/trec-covid/index_cord19.py
@@ -53,38 +53,31 @@ def download_url(url, save_dir):
 
 def download_collection(date):
     print(f'Downloading CORD-19 release of {date}...')
-    collection_dir = f'collections/cord19-{date}'
-    documents_url = f'https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/{date}/document_parses.tar.gz'
-    documents_local = f'{collection_dir}/document_parses.tar.gz'
-    documents_dir_local = f'{collection_dir}/document_parses'
-    metadata_url = f'https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/{date}/metadata.csv'
-    metadata_local = f'{collection_dir}/metadata.csv'
-
-    if not os.path.isdir(collection_dir):
-        print(f'{collection_dir} does not exist, creating...')
-        os.mkdir(collection_dir)
-    else:
-        print(f'{collection_dir} already exists.')
+    collection_dir = f'collections/'
+    tarball_url = f'https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_{date}.tar.gz'
+    tarball_local = os.path.join(collection_dir, f'cord-19_{date}.tar.gz')
 
-    if not os.path.exists(metadata_local):
-        print(f'Fetching {metadata_url}...')
-        download_url(metadata_url, collection_dir)
+    if not os.path.exists(tarball_local):
+        print(f'Fetching {tarball_url}...')
+        download_url(tarball_url, collection_dir)
     else:
-        print(f'{metadata_local} already exists, skipping download.')
+        print(f'{tarball_local} already exists, skipping download.')
 
-    if not os.path.exists(documents_local):
-        print(f'Fetching {documents_url}...')
-        download_url(documents_url, collection_dir)
-    else:
-        print(f'{documents_local} already exists, skipping download.')
+    print(f'Extracting {tarball_local} into {collection_dir}')
+    tarball = tarfile.open(tarball_local)
+    tarball.extractall(collection_dir)
+    tarball.close()
 
-    if not os.path.isdir(documents_dir_local):
-        print(f'Extracting documents into {documents_dir_local}')
-        tarball = tarfile.open(documents_local)
-        tarball.extractall(collection_dir)
-        tarball.close()
-    else:
-        print(f'{documents_dir_local} already exists, skipping unpacking.')
+    docparses = os.path.join(collection_dir, date, 'document_parses.tar.gz')
+    collection_base = os.path.join(collection_dir, date)
+
+    print(f'Extracting {docparses} into {collection_base}...')
+    tarball = tarfile.open(docparses)
+    tarball.extractall(collection_base)
+    tarball.close()
+
+    print(f'Renaming {collection_base}')
+    os.rename(collection_base, os.path.join(collection_dir, f'cord19-{date}'))
 
 
 def build_indexes(date):