The corresponding code for our paper: DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. Results on SentEval are presented below (as averaged scores on the downstream and probing task test sets), along with existing state-of-the-art methods.
Model | Requires labelled data? | Parameters | Embed. dim. | Downstream (-SNLI) | Probing | Δ |
---|---|---|---|---|---|---|
InferSent V2 | Yes | 38M | 4096 | 76.00 | 72.58 | -3.06 |
Universal Sentence Encoder | Yes | 147M | 512 | 78.89 | 66.70 | -0.17 |
Sentence Transformers ("roberta-base-nli-mean-tokens") | Yes | 125M | 768 | 77.19 | 63.22 | -1.87 |
Transformer-small (DistilRoBERTa-base) | No | 82M | 768 | 72.58 | 74.57 | -6.48 |
Transformer-base (RoBERTa-base) | No | 125M | 768 | 72.70 | 74.19 | -6.36 |
DeCLUTR-small (DistilRoBERTa-base) | No | 82M | 768 | 77.41 | 74.71 | -1.65 |
DeCLUTR-base (RoBERTa-base) | No | 125M | 768 | 79.06 | 74.65 | -- |
Transformer-* is the same underlying architecture and pretrained weights as DeCLUTR-* before continued pretraining with our contrastive objective. Transformer-* and DeCLUTR-* use mean pooling on their token-level embeddings to produce a fixed-length sentence representation. Downstream scores are computed without considering perfomance on SNLI (denoted "Downstream (-SNLI)") as InferSent, USE and Sentence Transformers all train on SNLI. Δ: difference to DeCLUTR-base downstream score.
The easiest way to get started is to follow along with one of our notebooks:
- Training your own model
- Embedding text with a pretrained model
- Evaluating a model with SentEval
This repository requires Python 3.6.1 or later.
Before installing, you should create and activate a Python virtual environment. See here for detailed instructions.
If you don't plan on modifying the source code, install from git
using pip
pip install git+https://github.com/JohnGiorgi/DeCLUTR.git
Otherwise, clone the repository locally and then install
git clone https://github.com/JohnGiorgi/DeCLUTR.git
cd DeCLUTR
pip install --editable .
- If you plan on training your own model, you should also install PyTorch with CUDA support by following the instructions for your system here.
A dataset is simply a file containing one item of text (a document, a scientific paper, etc.) per line. For demonstration purposes, we have provided a script that will download the WikiText-103 dataset and match our minimal preprocessing
python scripts/preprocess_wikitext_103.py path/to/output/wikitext-103/train.txt --min-length 2048
See scripts/preprocess_openwebtext.py for a script that can be used to recreate the (much larger) dataset used in our paper.
You can specify the train set path in the configs under "train_data_path"
.
- A training dataset should contain documents with a minimum of
num_anchors * max_span_len * 2
whitespace tokens. This is required to sample spans according to our sampling procedure. See the dataset reader and/or our paper for more details on these hyperparameters.
To train the model, use the allennlp train
command with our declutr.jsonnet
config. For example, to train DeCLUTR-small, run the following
# This can be (almost) any model from https://huggingface.co/ that supports masked language modelling.
TRANSFORMER_MODEL="distilroberta-base"
allennlp train "training_config/declutr.jsonnet" \
--serialization-dir "output" \
--overrides "{'train_data_path': 'path/to/your/dataset/train.txt'}" \
--include-package "declutr"
The --overrides
flag allows you to override any field in the config with a JSON-formatted string, but you can equivalently update the config itself if you prefer. During training, models, vocabulary, configuration, and log files will be saved to the directory provided by --serialization-dir
. This can be changed to any directory you like.
We have provided a simple script to export a trained model so that it can be loaded with Hugging Face Transformers
wget -nc https://github.com/JohnGiorgi/DeCLUTR/blob/master/scripts/save_pretrained_hf.py
python save_pretrained_hf.py --archive-file "output" --save-directory "output_transformers"
The model, saved to --save-directory
, can then be loaded using the Hugging Face Transformers library (see Embedding for more details)
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("output_transformers")
model = AutoModel.from_pretrained("output_transformers")
If you would like to upload your model to the Hugging Face model repository, follow the instructions here.
To train on more than one GPU, provide a list of CUDA devices in your call to allennlp train
. For example, to train with four CUDA devices with IDs 0, 1, 2, 3
--overrides "{'distributed.cuda_devices': [0, 1, 2, 3]}"
If your GPU supports it, mixed-precision will be used automatically during training and inference.
You can embed text with a trained model in one of three ways:
- As a library: import and initialize an object from this repo, which can be used to embed sentences/paragraphs.
- Hugging Face Transformers: load our pretrained model with the Hugging Face Transformers library.
- Bulk embed: embed all text in a given text file with a simple command-line interface.
Available pre-trained models:
To use the model as a library, import Encoder
and pass it some text (it accepts both strings and lists of strings)
from declutr import Encoder
# This can be a path on disk to a model you have trained yourself OR
# the name of one of our pretrained models.
pretrained_model_or_path = "declutr-small"
encoder = Encoder(pretrained_model_or_path)
embeddings = encoder([
"A smiling costumed woman is holding an umbrella.",
"A happy woman in a fairy costume holds an umbrella."
])
these embeddings can then be used, for example, to compute the semantic similarity between some number of sentences or paragraphs
from scipy.spatial.distance import cosine
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])
See the list of available PRETRAINED_MODELS
in declutr/encoder.py
python -c "from declutr.encoder import PRETRAINED_MODELS ; print(list(PRETRAINED_MODELS.keys()))"
Our pretrained models are also hosted with Hugging Face Transformers, so they can be used like any other model in that library. Here is a simple example:
import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer
# Load the model
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-small")
model = AutoModel.from_pretrained("johngiorgi/declutr-small")
# Prepare some text to embed
texts = [
"A smiling costumed woman is holding an umbrella.",
"A happy woman in a fairy costume holds an umbrella.",
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Embed the text
with torch.no_grad():
sequence_output = model(**inputs)[0]
# Mean pool the token-level embeddings to get sentence-level embeddings
embeddings = torch.sum(
sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)
# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])
To embed all text in a given file with a trained model, run the following command
allennlp predict "output" "path/to/input.txt" \
--output-file "output/embeddings.jsonl" \
--batch-size 32 \
--cuda-device 0 \
--use-dataset-reader \
--overrides "{'dataset_reader.num_anchors': null}" \
--include-package "declutr"
This will:
- Load the model serialized to
"output"
with the "best" weights (i.e. the ones that achieved the lowest loss during training). - Use that model to embed the text in the provided input file (
"path/to/input.txt"
). - Save the embeddings to disk as a JSON lines file (
"output/embeddings.jsonl"
)
The text embeddings are stored in the field "embeddings"
in "output/embeddings.jsonl"
.
SentEval is a library for evaluating the quality of sentence embeddings. We provide a script to evaluate our model against SentEval. We have provided a notebook that documents the process of evaluating a trained model on SentEval. Broadly, the steps are the following:
First, clone the SentEval repository and download the transfer task datasets (you only need to do this once)
# Clone our fork which has several bug fixes merged
git clone https://github.com/JohnGiorgi/SentEval.git
cd SentEval/data/downstream/
./get_transfer_data.bash
cd ../../../
See the SentEval repository for full details.
Then you can run our script to evaluate a trained model against SentEval
python scripts/run_senteval.py allennlp "SentEval" "output"
--output-filepath "output/senteval_results.json" \
--cuda-device 0 \
--include-package "declutr"
The results will be saved to "output/senteval_results.json"
. This can be changed to any path you like.
Pass the flag
--prototyping-config
to get a proxy of the results while dramatically reducing computation time.
For a list of commands, run
python scripts/run_senteval.py --help
For help with a specific command, e.g. allennlp
, run
python scripts/run_senteval.py allennlp --help
If you use DeCLUTR in your work, please consider citing our preprint
@article{Giorgi2020DeCLUTRDC,
title={DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations},
author={John M Giorgi and Osvald Nitski and Gary D. Bader and Bo Wang},
journal={ArXiv},
year={2020},
volume={abs/2006.03659}
}