Coding Textual Inputs Boosts the Accuracy of Neural Networks

Abdul Rafae Khan⁺, Jia Xu⁺ & Weiwei Sun⁺⁺

⁺ Stevens Institute of Technology

⁺⁺ Cambridge University

(1) Coding Neural Machine Translation (NMT)

Install dependencies

Setup fairseq toolkit

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

# on MacOS:
# CFLAGS="-stdlib=libc++" pip install --editable ./

Install evaluation packages

pip install sacrebleu

Install tokenization packages

git clone https://github.com/moses-smt/mosesdecoder.git

Install Byte-pair Encoding pacakges

git clone https://github.com/rsennrich/subword-nmt.git

Download and Pre-process IWSLT'17 French-English data

Download and tokenize data

bash prepare_data.sh

Create Metaphone coded data

python phonetic_encoder.py --source fr --target en --input data/ --output data/ --coding metaphone --files train,valid,test
Use --coding metaphone,nysiis,soundex for multiple coding

Byte-pair encode the data

bash apply_bpe.sh fr mt en

Train NMT System

Train French+Metaphone-English Concatenation Model

bash train_concatenation.sh

or

Train French+Metaphone-English Multisource Model

bash train_multisource.sh

(2) Coding Language Modeling

Install dependencies

git clone https://github.com/facebookresearch/XLM.git

Download and Pre-process IWSLT'17 French-English data

Download and tokenize data

bash prepare_data.sh

Create vocabulary

OUTPATH=data/processed/XLM_en/30k
mkdir -p $OUTPATH

python utils/getvocab.py --input $OUTPATH/train.en --output $OUTPATH/vocab.en
python utils/getvocab.py --input $OUTPATH/train.ny --output $OUTPATH/vocab.ny
python utils/getvocab.py --input $OUTPATH/train.enny --output $OUTPATH/vocab.enny

Binarize data

python XLM/preprocess.py $OUTPATH/vocab.en $OUTPATH/train.en &
python XLM/preprocess.py $OUTPATH/vocab.en $OUTPATH/valid.en &
python XLM/preprocess.py $OUTPATH/vocab.en $OUTPATH/test.en &

python XLM/preprocess.py $OUTPATH/vocab.ny $OUTPATH/train.ny &
python XLM/preprocess.py $OUTPATH/vocab.ny $OUTPATH/valid.ny &
python XLM/preprocess.py $OUTPATH/vocab.ny $OUTPATH/test.ny &

python XLM/preprocess.py $OUTPATH/vocab.enny $OUTPATH/train.enny &
python XLM/preprocess.py $OUTPATH/vocab.enny $OUTPATH/valid.enny &
python XLM/preprocess.py $OUTPATH/vocab.enny $OUTPATH/test.enny &

mv $OUTPATH/train.enny.pth $OUTPATH/train.en-ny.pth
mv $OUTPATH/valid.enny.pth $OUTPATH/valid.en-ny.pth
mv $OUTPATH/test.enny.pth $OUTPATH/test.en-ny.pth

Train English baseline

CUDA_VISIBLE_DEVICES=0 python train.py --exp_name xlm_en --dump_path ./dumped_xlm_en --data_path $OUTPATH --lgs 'en' --clm_steps '' --mlm_steps 'en' --emb_dim 256 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 32 --bptt 256 --optimizer adam_inverse_sqrt,lr=0.00010,warmup_updates=30000,beta1=0.9,beta2=0.999,weight_decay=0.01,eps=0.000001 --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_en_mlm_ppl --stopping_criterion _valid_en_mlm_ppl,25 --fp16 true --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15'

Train English+NYSIIS

CUDA_VISIBLE_DEVICES=0 python train.py --exp_name xlm_en_ny --dump_path ./dumped_xlm_en_ny --data_path $OUTPATH --lgs 'en' --clm_steps '' --mlm_steps 'en,ny' --emb_dim 256 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 32 --bptt 256 --optimizer adam_inverse_sqrt,lr=0.00010,warmup_updates=30000,beta1=0.9,beta2=0.999,weight_decay=0.01,eps=0.000001 --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_en_mlm_ppl --stopping_criterion _valid_en_mlm_ppl,25 --fp16 true --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15'

Train English+NYSIIS+Word Alignment

CUDA_VISIBLE_DEVICES=0 python train.py --exp_name mlm_en_ny --dump_path ./dumped_mlm_en_ny --data_path $OUTPATH --lgs 'en' --clm_steps '' --mlm_steps 'en,ny,en-ny' --emb_dim 256 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --batch_size 32 --bptt 256 --optimizer adam_inverse_sqrt,lr=0.00010,warmup_updates=30000,beta1=0.9,beta2=0.999,weight_decay=0.01,eps=0.000001 --epoch_size 300000 --max_epoch 100000 --validation_metrics _valid_en_mlm_ppl --stopping_criterion _valid_en_mlm_ppl,25 --fp16 true --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15'

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
utils		utils
README.md		README.md
apply_bpe.sh		apply_bpe.sh
generate_codings.py		generate_codings.py
get_huffman.py		get_huffman.py
huffman.py		huffman.py
phonetic_encoder.py		phonetic_encoder.py
prepare_data.sh		prepare_data.sh
train_concatenation.sh		train_concatenation.sh
train_multisource.sh		train_multisource.sh
tree.py		tree.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coding Textual Inputs Boosts the Accuracy of Neural Networks

(1) Coding Neural Machine Translation (NMT)

Install dependencies

Download and Pre-process IWSLT'17 French-English data

Train NMT System

(2) Coding Language Modeling

Install dependencies

Download and Pre-process IWSLT'17 French-English data

About

Releases

Packages

Languages

abdulrafae/coding_nmt

Folders and files

Latest commit

History

Repository files navigation

Coding Textual Inputs Boosts the Accuracy of Neural Networks

(1) Coding Neural Machine Translation (NMT)

Install dependencies

Download and Pre-process IWSLT'17 French-English data

Train NMT System

(2) Coding Language Modeling

Install dependencies

Download and Pre-process IWSLT'17 French-English data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages