Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
…into newtokenizers
  • Loading branch information
teetone committed Jul 27, 2022
2 parents 1c2909c + df4f6b6 commit ae09ff3
Show file tree
Hide file tree
Showing 3 changed files with 65 additions and 43 deletions.
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -306,15 +306,16 @@ to estimate the token usage. The tokenizer will be downloaded and cached when ru

## Final benchmarking (Infrastructure team only)

1. Running all the `RunSpec`s can take a long time, so use SCDT: `ssh scdt`.
1. `ssh sc`.
1. Create a screen session: `screen -S benchmarking`.
1. Use a john to run the suite: `nlprun --priority high -c 8 -g 0 --memory 64g`.
1. Go to the source code directory: `cd /u/scr/nlp/crfm/benchmarking/benchmarking`.
We have 700 GB of disk space total on `/u/scr/nlp/crfm`.
1. Pull the latest changes: `git pull`.
1. Activate the Conda environment: `conda activate crfm_benchmarking`
1. Run `pip install -e .` if there are new dependencies to install.
1. Run the `benchmark-present` command e.g.,
`benchmark-present --max-eval-instances 500 --conf src/benchmark/presentation/run_specs.conf &> run.log`.
1. Run `benchmark-present-all.sh`:
`bash scripts/benchmark-present-all.sh --max-eval-instances 1000 --num-threads 1 --priority 2 --local`.
1. Exit the screen session: `ctrl+ad`.
1. To check on the screen session: `screen -r benchmarking`.

Expand Down
1 change: 1 addition & 0 deletions scripts/benchmark-present-all.sh
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ models=(
"microsoft/TNLGv2_530B"
"microsoft/TNLGv2_7B"
"together/gpt-j-6b"
"together/gpt-neox-20b"
)

for model in "${models[@]}"
Expand Down
100 changes: 60 additions & 40 deletions src/benchmark/presentation/run_specs.conf
Original file line number Diff line number Diff line change
Expand Up @@ -16,21 +16,28 @@
##### Generic #####

##### Question Answering #####
# Scenarios: BoolQ, NarrativeQA, NewsQA, QuAC
# Scenarios: NaturalQuestions
# Scenarios: CommonsenseQA, HellaSwag, OpenBookQA, TruthfulQA
# Scenarios: MMLU

# Reading comprehension
## Reading comprehension

"boolq:model=text,data_augmentation=canonical": {status: "READY", priority: 1}
"narrative_qa:model=text,data_augmentation=canonical": {status: "READY", priority: 2}
"news_qa:model=text,data_augmentation=canonical": {status: "READY", priority: 3}
"boolq:model=text,data_augmentation=canonical": {status: "READY", priority: 1}
"quac:model=text,data_augmentation=canonical": {status: "READY", priority: 1}

# Reading comprehension and closedbook QA variants
## Reading comprehension and closedbook QA variants

"natural_qa:model=text,mode=openbook-longans,data_augmentation=canonical": {status: "READY", priority: 1}
"natural_qa:model=text,mode=closedbook,data_augmentation=canonical": {status: "READY", priority: 1}

# Closed-book QA with multiple choice
"commonsense:model=text,dataset=hellaswag,method=multiple_choice_joint": {status: "READY", priority: 1}
"commonsense:model=text,dataset=openbookqa,method=multiple_choice_joint": {status: "READY", priority: 2}
"commonsense:model=text,dataset=commonsenseqa,method=multiple_choice_joint": {status: "READY", priority: 2}
## Closed-book QA with multiple choice

"commonsense:model=text,dataset=commonsenseqa,method=multiple_choice_joint": {status: "READY", priority: 3}
"commonsense:model=text,dataset=hellaswag,method=multiple_choice_separate_calibrated": {status: "READY", priority: 1}
"commonsense:model=text,dataset=openbookqa,method=multiple_choice_separate_calibrated": {status: "READY", priority: 2}
"truthful_qa:model=text,task=mc_single": {status: "READY", priority: 2}

# For MMLU, we sampled the following 10 subjects, which cover diverse topics across humanities, social sciences and STEM.
Expand Down Expand Up @@ -94,7 +101,7 @@


##### Information Retrieval #####
# Scenarios: MS Marco, TREC
# Scenarios: MS Marco (Regular), MS MARCO (TREC)

# TODO: rename scenario to msmarco, track to msmarco - Issue 527
# TODO: Update valid_topk=30 based on AI21 results
Expand All @@ -111,11 +118,15 @@
"summarization_xsum_sampled:model=text,temperature=0.3": {status: "READY", priority: 1}


##### Text Classification #####
# Scenarios: IMDB, RAFT, CivilComments
##### Sentiment Analysis #####
# Scenarios: IMDB

"imdb:model=text,data_augmentation=canonical": {status: "READY", priority: 1}


##### (Miscellaneous) Text Classification #####
# Scenarios: RAFT

"raft:subset=ade_corpus_v2,model=text,data_augmentation=canonical": {status: "READY", priority: 2}
"raft:subset=banking_77,model=text,data_augmentation=canonical": {status: "READY", priority: 2}
"raft:subset=neurips_impact_statement_risks,model=text,data_augmentation=canonical": {status: "READY", priority: 2}
Expand All @@ -128,14 +139,10 @@
"raft:subset=tai_safety_research,model=text,data_augmentation=canonical": {status: "READY", priority: 2}
"raft:subset=terms_of_service,model=text,data_augmentation=canonical": {status: "READY", priority: 2}

"entity_matching:model=text,dataset=Beer,data_augmentation=canonical": {status: "READY", priority: 1}
"entity_matching:model=text,dataset=Abt_Buy,data_augmentation=canonical": {status: "READY", priority: 2}
"entity_matching:model=text,dataset=Dirty_iTunes_Amazon,data_augmentation=canonical": {status: "READY", priority: 2}

"entity_data_imputation:model=text,dataset=Buy,data_augmentation=canonical": {status: "READY", priority: 1}
"entity_data_imputation:model=text,dataset=Restaurant,data_augmentation=canonical": {status: "READY", priority: 2}
##### Toxicity Detection #####
# Scenarios: CivilComments

# Performance disparities
"civil_comments:model=text,data_augmentation=canonical,subject=all": {status: "READY", priority: 1}
"civil_comments:model=text,data_augmentation=canonical,subject=asian": {status: "READY", priority: 3}
"civil_comments:model=text,data_augmentation=canonical,subject=atheist": {status: "READY", priority: 3}
Expand Down Expand Up @@ -168,19 +175,23 @@
##### Language #####
# Scenarios: BLiMP, The Pile, ICE, WikiText-103, TwitterAAE

# TODO: convert this into multiple choice, and let adaptation handle it (input empty, references are the two sentences)
"blimp:model=full_functionality_text,phenomenon=island_effects": {status: "READY", priority: 3}
"blimp:model=full_functionality_text,phenomenon=anaphor_agreement": {status: "READY", priority: 3}
"blimp:model=full_functionality_text,phenomenon=argument_structure": {status: "READY", priority: 3}
"blimp:model=full_functionality_text,phenomenon=determiner_noun_agreement": {status: "READY", priority: 3}
"blimp:model=full_functionality_text,phenomenon=subject_verb_agreement": {status: "READY", priority: 3}
"blimp:model=full_functionality_text,phenomenon=ellipsis": {status: "READY", priority: 3}
"blimp:model=full_functionality_text,phenomenon=control_raising": {status: "READY", priority: 3}
"blimp:model=full_functionality_text,phenomenon=quantifiers": {status: "READY", priority: 3}
"blimp:model=full_functionality_text,phenomenon=irregular_forms": {status: "READY", priority: 3}
"blimp:model=full_functionality_text,phenomenon=npi_licensing": {status: "READY", priority: 3}
"blimp:model=full_functionality_text,phenomenon=binding": {status: "READY", priority: 3}
"blimp:model=full_functionality_text,phenomenon=filler_gap_dependency": {status: "READY", priority: 3}
# We select 4 phenomena to elevate to priority 2, one per linguistic field.
# The phenomena in BLiMP are annotated belong to one of the following 4 linguistic fields:
# Morphology, Semantics, Syntax, and Syntax-Semantics
# Beyond ensuring coverage of these 4 fields, to choose the higher priority representative,
# we choose the phenomena within the field with the lowest GPT-2 performance reported (Warsadt et al., 2020).
"blimp:model=full_functionality_text,phenomenon=anaphor_agreement": {status: "READY", priority: 3} # Morphology
"blimp:model=full_functionality_text,phenomenon=determiner_noun_agreement": {status: "READY", priority: 3} # Morphology
"blimp:model=full_functionality_text,phenomenon=irregular_forms": {status: "READY", priority: 2} # Morphology
"blimp:model=full_functionality_text,phenomenon=subject_verb_agreement": {status: "READY", priority: 3} # Morphology
"blimp:model=full_functionality_text,phenomenon=quantifiers": {status: "READY", priority: 2} # Semantics
"blimp:model=full_functionality_text,phenomenon=npi_licensing": {status: "READY", priority: 3} # Semantics
"blimp:model=full_functionality_text,phenomenon=argument_structure": {status: "READY", priority: 3} # Syntax
"blimp:model=full_functionality_text,phenomenon=ellipsis": {status: "READY", priority: 3} # Syntax
"blimp:model=full_functionality_text,phenomenon=filler_gap_dependency": {status: "READY", priority: 3} # Syntax
"blimp:model=full_functionality_text,phenomenon=island_effects": {status: "READY", priority: 2} # Syntax
"blimp:model=full_functionality_text,phenomenon=binding": {status: "READY", priority: 2} # Syntax-Semantics
"blimp:model=full_functionality_text,phenomenon=control_raising": {status: "READY", priority: 3} # Syntax-Semantics

## Language modeling

Expand Down Expand Up @@ -316,9 +327,8 @@
"wikifact:model=text,k=5,subject=P86": {status: "READY", priority: 4}
"wikifact:model=text,k=5,subject=P937": {status: "READY", priority: 4}

##### Reasoning #####

# Evaluate all text and Codex models for all reasoning tasks
##### Reasoning #####

## Synthetic
"numeracy:model=all,run_solver=True,relation_type=linear,mode=function": {status: "READY", priority: 2}
Expand Down Expand Up @@ -362,7 +372,6 @@

## Real

# Roughly reproduce MATH settings:
"math:model=all,subject=number_theory,level=1,use_official_examples=True": {status: "READY", priority: 2}
"math:model=all,subject=intermediate_algebra,level=1,use_official_examples=True": {status: "READY", priority: 2}
"math:model=all,subject=algebra,level=1,use_official_examples=True": {status: "READY", priority: 2}
Expand Down Expand Up @@ -446,14 +455,23 @@

"gsm:model=all": {status: "READY", priority: 2}

# Legal reasoning
"legal_support:model=all": {status: "READY", priority: 2}

"lsat_qa:model=all,task=all": {status: "READY", priority: 2}
"lsat_qa:model=all,task=grouping": {status: "READY", priority: 3}
"lsat_qa:model=all,task=ordering": {status: "READY", priority: 3}
"lsat_qa:model=all,task=assignment": {status: "READY", priority: 3}
"lsat_qa:model=all,task=miscellaneous": {status: "READY", priority: 3}

# Legal reasoning
"legal_support:model=all": {status: "READY", priority: 2}
# Data processing

"entity_matching:model=text,dataset=Beer": {status: "READY", priority: 2}
"entity_matching:model=text,dataset=Abt_Buy": {status: "READY", priority: 2}
"entity_matching:model=text,dataset=Dirty_iTunes_Amazon": {status: "READY", priority: 2}

"entity_data_imputation:model=text,dataset=Buy": {status: "READY", priority: 2}
"entity_data_imputation:model=text,dataset=Restaurant": {status: "READY", priority: 2}

# Code
"code:model=code,dataset=HumanEval": {status: "READY", priority: 1}
Expand All @@ -462,6 +480,8 @@

##### Harms #####

## Copyright

# Randomly sampled instances from the original BooksCorpus.
# We expect data here to be less repeated in the pretraining corpus. This approximates the average case.
"copyright:model=text,datatag=n_books_1000-extractions_per_book_1-prefix_length_125": {status: "READY", priority: 1}
Expand All @@ -474,11 +494,14 @@
"copyright:model=code,datatag=prompt_num_line_5-min_lines_20.json": {status: "READY", priority: 3}
"copyright:model=code,datatag=prompt_num_line_10-min_lines_20.json": {status: "READY", priority: 2}

"disinformation:model=text,capability=reiteration,topic=covid": {status: "READY", priority: 1}
## Disinformation

"disinformation:model=text,capability=reiteration,topic=climate": {status: "READY", priority: 1}
"disinformation:model=text,capability=reiteration,topic=covid": {status: "READY", priority: 1}
"disinformation:model=text,capability=wedging": {status: "READY", priority: 1}

# TODO: some are missing instances?
## Bias

"bbq:model=text,subject=all": {status: "READY", priority: 2}
"bbq:model=text,subject=age": {status: "READY", priority: 3}
"bbq:model=text,subject=disability_status": {status: "READY", priority: 3}
Expand All @@ -492,11 +515,8 @@
"bbq:model=text,subject=SES": {status: "READY", priority: 3}
"bbq:model=text,subject=sexual_orientation": {status: "READY", priority: 3}

### Toxicity
# TODO: consider better prompting (same for RealToxicityPrompts)
## Toxicity

# Calculating toxicity metric:
# Empirical probability of generating a span with Toxicity >= 0.5 at least once over k generations
"real_toxicity_prompts:model=text": {status: "READY", priority: 2}

"bold:model=text,subject=all": {status: "READY", priority: 2}
Expand Down

0 comments on commit ae09ff3

Please sign in to comment.