Life After BERT

The code accompanying the paper Life after BERT: What do Other Muppets Understand about Language?

In this work, we apply a variety of models to the oLMpics benchmark and psycholinguistic probing datasets to determine the role of different factors such as architecture, directionality, size of the dataset, and pre-training objective on a model's linguistic capabilities.

Installation

You can install the package via

pip install git+https://github.com/kev-zhao/life-after-bert

Or (recommended) you can download the source code and install the package in editable mode:

git clone https://github.com/kev-zhao/life-after-bert
cd life-after-bert
pip install -e .

Run

usage: main.py [-h] --model_name MODEL_NAME --model_arch MODEL_ARCH [--mask_token MASK_TOKEN]

optional arguments:
  -h, --help            show this help message and exit
  --model_name MODEL_NAME
                        Identifier of any pretrained HuggingFace model
  --model_arch MODEL_ARCH
                        Model architecture, among `encoder`, `decoder`, and `encoder-decoder`
  --mask_token MASK_TOKEN
                        Tokenizer mask token (string), if different from default. Mainly used for GPT2 ("[MASK]") and T5 ("<extra_id_0>").

Example:

python main.py \
    --model_name roberta-large \
    --model_arch encoder

Expected output:

{
    'Age Comparison': 0.986, 
    'Always Never': 0.1357142857142857, 
    'Antonym Negation': 0.744, 
    'Multihop Composition': 0.28, 
    'Size Comparison': 0.874, 
    'Taxonomy Conjunction': 0.4540901502504174
}

Python Examples:

from datasets import load_dataset
import torch, transformers
import life_after_bert

device = "cuda" if torch.cuda.is_available() else "cpu"

# Use an encoder model
model = transformers.AutoModelForMaskedLM.from_pretrained("roberta-large")
tokenizer = transformers.AutoTokenizer.from_pretrained("roberta-large")

# Evaluate on a HuggingFace dataset
hub_dataset = load_dataset("KevinZ/oLMpics", "Age_Comparison")["test"]
dataset = life_after_bert.MCDataset(hub_dataset["stem"], hub_dataset["choices"], hub_dataset["answerKey"], num_choices=2, tokenizer=tokenizer)

accuracy, (all_answers, all_preds) = life_after_bert.evaluate_encoder(model, tokenizer, dataset, device=device)
print(f"{model.config._name_or_path} accuracy: {accuracy} on Age Comparison task")

# Use a decoder model
model = transformers.AutoModelForCausalLM.from_pretrained("gpt2-large")
tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2-large", mask_token="[MASK]")  # Causal LM's don't have mask tokens by default

# Evaluate on a custom dataset from a jsonl file
dataset = life_after_bert.MCDataset.load_data("../tests/data/oLMpics_always_never_dev.jsonl", num_choices=5, tokenizer=tokenizer)

accuracy, (all_answers, all_preds) = life_after_bert.evaluate_decoder(model, tokenizer, dataset, device=device)
print(f"{model.config._name_or_path} accuracy: {accuracy} on Always Never task")

# Or evaluate on multiple tasks
evaluator = life_after_bert.LaBEvaluator()
task_infos = [
    ("Antonym Negation", 2),  # You can use the name of an oLMpics task
    ("Taxonomy Conjunction", 3),
    ("../tests/data/oLMpics_multihop_composition_dev.jsonl", 3),  # Or pass in the file paths
    ("../tests/data/oLMpics_size_comparison_dev.jsonl", 2),
]

# Use an encoder-decoder model
model = transformers.AutoModelForSeq2SeqLM.from_pretrained("t5-large")
tokenizer = transformers.AutoTokenizer.from_pretrained("t5-large", mask_token="<extra_id_0>")

task_accs = evaluator.evaluate(model, tokenizer, task_infos, model_arch="encoder-decoder", device=device)
print(task_accs)

Tested Models:

BERT
RoBERTa
DistilBERT
T5
BART (due to the HuggingFace implementation, this works as an encoder instead of encoder-decoder)
Pegasus (Seq2Seq model, not included in paper)
GPT2

Citation

@inproceedings{Lialin2022LifeAB,
  title={Life after BERT: What do Other Muppets Understand about Language?},
  author={Vladislav Lialin and Kevin Zhao and Namrata Shivagunde and Anna Rumshisky},
  booktitle={ACL},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
life_after_bert		life_after_bert
notebooks		notebooks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Life After BERT

Installation

Run

Example:

Python Examples:

Tested Models:

Citation

About

Releases

Packages

Contributors 2

Languages

License

kev-zhao/life-after-bert

Folders and files

Latest commit

History

Repository files navigation

Life After BERT

Installation

Run

Example:

Python Examples:

Tested Models:

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages