Skip to content

Commit

Permalink
chore: add wiki cleaner
Browse files Browse the repository at this point in the history
  • Loading branch information
trancongman276 committed Apr 15, 2023
1 parent b598115 commit 3492339
Show file tree
Hide file tree
Showing 3 changed files with 3,492 additions and 1 deletion.
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,17 @@ Download the processed Yelp and Yahoo datasets by running:
```
bash download_data.sh
```
## Cleaning
```
$ wget https://dumps.wikimedia.org/viwiki/latest/viwiki-latest-pages-articles.xml.bz2
$ bzip2 -d viwiki-latest-pages-articles.xml.bz2
$ python WikiExtractor.py --no-templates -s --lists viwiki-latest-pages-articles.xml -q -o - | perl -CSAD -Mutf8 cleaner.pl > viwiklatest.txt
```

## Training
The basic training command is:
```
python train.py --train data/yelp/train.txt --valid data/yelp/valid.txt --model_type aae --lambda_adv 10 --noise 0.3,0,0,0 --save-dir checkpoints/yelp/daae
python train.py --train data/yahoo/train.txt --valid data/yahoo/valid.txt --model_type aae --lambda_adv 10 --noise 0.3,0,0,0 --save-dir checkpoints/daae
```
To train various models, use the following options:
- AE: `--model_type dae --save-dir checkpoints/yelp/ae`
Expand Down
23 changes: 23 additions & 0 deletions cleaner.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
while (<>) {
s/{([^{}]|(?0))*}//g;
s/\[([^\[\]]|(?0))*]//g;

s/\(([^()]|(?0))*\)//g;

s/<[^>]+>/ /g;
s/<\/[^>]+>/ /g;
s/"+/"/g;
s/<[^>]+>[^<]*<\/[^>]+>/ /g;
s//-/g;
s/[^\|]+\|([^\|]+)/$1/g;
s/[^\|]+\|([^\|]+)/$1/g;
s/[^\|]+\|([^\|]+)/$1/g;
s/[^\|]+\|([^\|]+)/$1/g;
s/[ ]+,/,/g;
s/[ ]+[.]/./g;
s/[ ]+ / /g;
s//.../g;
s//"/g;
s/”/"/g;
print $_;
}
Loading

0 comments on commit 3492339

Please sign in to comment.