forked from castorini/anserini
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Replication and tweaks of 20 Newsgroups (castorini#1204)
- Loading branch information
Showing
2 changed files
with
43 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,71 +1,82 @@ | ||
# Anserini: 20 Newsgroups | ||
# Anserini: Working the 20 Newsgroups Dataset | ||
|
||
This page contains instructions for how to index the 20 Newsgroups dataset. | ||
This page contains instructions for how to index the [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/). | ||
|
||
## Data Prep | ||
|
||
We're going to use `20newsgroups/` as the working directory. | ||
There are many versions of the 20 Newsgroups dataset available on the web, we're specifically going to use [this one](http://qwone.com/~jason/20Newsgroups/) (the "bydate" version). | ||
We're going to use `collections/20newsgroups/` as the working directory. | ||
First, we need to download and extract the dataset: | ||
|
||
```sh | ||
```bash | ||
mkdir -p collections/20newsgroups/ | ||
wget -nc http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz -P collections/20newsgroups | ||
tar -xvzf collections/20newsgroups/20news-bydate.tar.gz -C collections/20newsgroups | ||
``` | ||
|
||
To confirm, `20news-bydate.tar.gz` should have MD5 checksum of `d6e9e45cb8cb77ec5276dfa6dfc14318`. | ||
|
||
After untaring, you should see the following two folders: | ||
``` | ||
After unpacking, you should see the following two folders: | ||
|
||
```bash | ||
ls collections/20newsgroups/20news-bydate-test | ||
ls collections/20newsgroups/20news-bydate-train | ||
``` | ||
|
||
There are docs with the same id in different categories. | ||
For example, doc `123` can exists in `misc.forsale` & `sci.crypt` even if the two docs have different text. Hence we need prune the dataset by ensuring each doc has a unique id. | ||
To prune and merge them into one folder: | ||
``` | ||
For example, doc `123` exists in `misc.forsale` and `sci.crypt`, with different texts. | ||
Since we assume unique docids when building an index, we need to clean the the dataset first. | ||
To prune and merge both train and test splits into one folder: | ||
|
||
```bash | ||
python src/main/python/20newsgroups/prune_and_merge.py \ | ||
--paths collections/20newsgroups/20news-bydate-test \ | ||
collections/20newsgroups/20news-bydate-train \ | ||
--paths collections/20newsgroups/20news-bydate-test collections/20newsgroups/20news-bydate-train \ | ||
--out collections/20newsgroups/20news-bydate | ||
``` | ||
|
||
Now you should see train & test merged into one folder in `20newsgroups/20news-bydate/`. | ||
Now you should see the train and test splits merged into one folder in `collections/20newsgroups/20news-bydate/`. | ||
|
||
# Indexing | ||
## Indexing | ||
|
||
To index train & test together: | ||
``` | ||
To index train and test together: | ||
|
||
```bash | ||
sh target/appassembler/bin/IndexCollection -collection TwentyNewsgroupsCollection \ | ||
-input collections/20newsgroups/20news-bydate \ | ||
-index indexes/lucene-index.20newsgroups.pos+docvectors+raw \ | ||
-index indexes/lucene-index.20newsgroups.all \ | ||
-generator DefaultLuceneDocumentGenerator -threads 2 \ | ||
-storePositions -storeDocvectors -storeRaw | ||
-storePositions -storeDocvectors -storeRaw -optimize | ||
``` | ||
|
||
To index the train set: | ||
``` | ||
To index just the train set: | ||
|
||
```bash | ||
sh target/appassembler/bin/IndexCollection -collection TwentyNewsgroupsCollection \ | ||
-input collections/20newsgroups/20news-bydate-train \ | ||
-index indexes/lucene-index.20newsgroups.train.pos+docvectors+raw \ | ||
-index indexes/lucene-index.20newsgroups.train \ | ||
-generator DefaultLuceneDocumentGenerator -threads 2 \ | ||
-storePositions -storeDocvectors -storeRaw | ||
-storePositions -storeDocvectors -storeRaw -optimize | ||
``` | ||
|
||
To index the test set: | ||
``` | ||
To index just the test set: | ||
|
||
```bash | ||
sh target/appassembler/bin/IndexCollection -collection TwentyNewsgroupsCollection \ | ||
-input collections/20newsgroups/20news-bydate-test \ | ||
-index indexes/lucene-index.20newsgroups.test.pos+docvectors+raw \ | ||
-index indexes/lucene-index.20newsgroups.test \ | ||
-generator DefaultLuceneDocumentGenerator -threads 2 \ | ||
-storePositions -storeDocvectors -storeRaw | ||
-storePositions -storeDocvectors -storeRaw -optimize | ||
``` | ||
|
||
You should see similar states as the table below. | ||
Indexing should take just a few seconds. | ||
For reference, the number of docs indexed should be exactly as follows: | ||
|
||
| | # of docs | pre-built index | | ||
|---------------|----------:|-----------------| | ||
| Train | 11,314 | [[download](https://www.dropbox.com/s/npg5eovr92h5k7w/lucene-index.20newsgroups.train.tar.gz)] | ||
| Test | 7,532 | [[download](https://www.dropbox.com/s/aptj8hz9wti3qaf/lucene-index.20newsgroups.test.tar.gz)] | ||
| Train + Test | 18,846 | [[download](https://www.dropbox.com/s/qo2wt6fzu01yt4c/lucene-index.20newsgroups.all.tar.gz)] | ||
|
||
For convenience, we also provide pre-built indexes above. | ||
|
||
| | Index Duration | # of docs | | ||
|---------------|-----------------|-----------| | ||
| Train | ~12 seconds | 11,314 | | ||
| Test | ~6 seconds | 7,532 | | ||
| Train + Test | ~15 seconds | 18,846 | | ||
## Replication Log |