ColBERT-X

ColBERT-X is a generalization of ColBERT for cross-language retrieval.

Training

ColBERT-X can be trained in two ways,

Zero-Shot (ZS) using English MS MARCO triples, and
Translate-Train (TT) using translated MS MARCO triples. The command for training is shown below:

CUDA_VISIBLE_DEVICES="0,1,2,3" \
python -m torch.distributed.run --nproc_per_node=4 -m \
xlmr_colbert.train --amp --doc_maxlen 180 --bsize 128 --accum 1 \
--triples /path/to/MSMARCO/triples.train.small.tsv --maxsteps 200000 \
--root /root/to/experiments/ --experiment MSMARCO-CLIR --similarity l2 --run msmarco.clir.l2

Detailed instructions for inference and PRF coming soon!

Changelog

Here we list the differences between the ColBERT v1 codebase and our code

Changed the model prefix from bert to roberta. Relevant issue here. This is necessary as the incorrect model prefix will not let the pretrained model weights be loaded and they would be initialized from scratch.
<PAD> token id is 0 for bert tokenizer and 1 for roberta tokenizer. Relevant line here.
roberta tokenizer does not include additional '[unused]' token prefix in the vocabulary. So, they have to be manually added and the embeddings have to be resized. Reference

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs/images		docs/images
scripts		scripts
xlmr_colbert		xlmr_colbert
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ColBERT-X

ColBERT-X is a generalization of ColBERT for cross-language retrieval.

Training

Changelog

About

Releases

Packages

Languages

cramraj8/ColBERT-X-running

Folders and files

Latest commit

History

Repository files navigation

ColBERT-X

ColBERT-X is a generalization of ColBERT for cross-language retrieval.

Training

Changelog

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages