Welcome! This internal repository contains all the assets for the CRFM/Mercury benchmarking project. There are two related parts:
- Proxy (
src/proxy
): provides a unified way to access major language models. - Benchmarking (see
src/benchmark
): evaluates such language models.
To install any dependencies (into venv
):
./pre-commit.sh
We provide a single unified entry point into accessing large language models (e.g., GPT-3, Jurassic). This provides both a web interface and a REST API.
To use the web interface, go to https://crfm-models.stanford.edu.
To use the REST API, see demo.py.
Create prod_env/credentials.conf
to contain the API keys for any language
models you have access to.
openaiApiKey: ...
ai21ApiKey: ...
To start a local server (go to http://localhost:1959
to try it out):
venv/bin/proxy-server
Bypass the added security that restricts multithreading by running:
OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES venv/bin/proxy-server
The production version of the proxy is running on crfm-models.stanford.edu
;
you need to get permission to get ssh access.
This is done, but just for the record:
laptop:$ ssh crfm-models.stanford.edu
crfm-models:$ cd /home
crfm-models:$ git clone [email protected]:stanford-crfm/benchmarking
crfm-models:$ cd benchmarking
crfm-models:$ mkdir prod_env
crfm-models:$ echo '{"api_key": "crfm"}' > prod_env/accounts.jsonl
laptop:$ rsync -arvz prod_env/credentials.conf crfm-models.stanford.edu:/home/benchmarking/prod_env
We use Google's Perspective API to calculate the toxicity of completions.
To send requests to PerspectiveAPI, we need to generate an API key from GCP. Follow the
Get Started guide
to request the service and the Enable the API guide
to generate the API key. Once you have a valid API key, add an entry to credentials.conf
:
perspectiveApiKey: <Generated API key>
By default, Perspective API allows only 1 query per second. Fill out this form to increase the request quota.
The current API key
we are using in production was created with the hai-gcp-models
account and allows 200 queries per second.
The API key expires on 7/30/2022.
The SSL certificate, CSR and private key for crfm-models.stanford.edu is stored at /home/ssl
.
The current SSL certificate expires on 12/30/2022.
To renew the SSL certificate, follow these steps:
-
Fill out this form:
- Log on with your SUNet ID. You must be an admin in order to submit a request.
- For
Server Name
, putcrfm-models.stanford.edu
. - For
Server type
, selectOTHER
. - For
Contact group/mailman address
, enter your Stanford email address. - Under
Copy and paste your CSR
, paste the content of/home/ssl/public.csr
. - Leave the optional fields blank and click
Submit
. - You should receive your certificate by email within 2 business days.
-
Once you receive the SSL cert, concatenate the contents of
X509 Certificate only, Base64 encoded
with the contents ofX509 Intermediates/root only Reverse, Base64 encoded
and place it at path/home/ssl/crfm-models.crt
.crfm-models.crt
should look something like this:-----BEGIN CERTIFICATE----- (Your Primary SSL certificate: .crt) -----END CERTIFICATE----- -----BEGIN CERTIFICATE----- (Your Intermediate certificate: reversed.crt) -----END CERTIFICATE-----
-
Restart the server.
-
Open the website in a browser and verify the connection is secure.
If, for whatever reason, the private key or CSR is misplaced, generate new ones by running:
sudo openssl req -new -nodes -newkey rsa:2048 -keyout private.key -out public.csr
and fill out the form:
Country Name (2 letter code) [AU]:US
State or Province Name (full name) [Some-State]:California
Locality Name (eg, city) []:Stanford
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Stanford University
Organizational Unit Name (eg, section) []:CRFM
Common Name (e.g. server FQDN or YOUR name) []:crfm-models.stanford.edu
Email Address []:
Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:
Then, follow the steps above to request for a new SSL certificate.
Update the code:
laptop:$ ssh crfm-models.stanford.edu
crfm-models:$ cd /home/benchmarking
crfm-models:$ git pull
crfm-models:$ ./pre-commit.sh
If everything looks okay:
ssh [email protected]
# Switch into the screen session
crfm-models:$ screen -r deploy
# Hit ctrl-c to kill the existing process
# Restart the server
sudo venv/bin/proxy-server -p 443 --ssl-key-file /home/ssl/private.key --ssl-cert-file /home/ssl/crfm-models.crt --workers 16 &> server.log
# Exit the screen session: ctrl-ad
The recommended number of Gunicorn workers is twice the number of cores.
crfm-models.stanford.edu has 8 cores (verified with nproc
) * 2 = 16 workers.
Double check that the website still works.
The server logs can be streamed by running: tail -f /home/benchmarking/server.log
.
Here's a birds-eye view of how the benchmarking process interacts with the main
classes (see benchmark
):
-
A
Scenario
(given by aScenarioSpec
) specifies a task and a data distribution. It specifies a set ofInstance
s, where eachInstance
has an input (e.g., question) and a set ofReference
outputs (e.g., multiple choice answers). -
A
DataPreprocessor
takes in aScenario
and produces a list ofInstance
s EachInstance
is given a unique ID. The set ofInstance
s is augmented according toDataAugmenterSpec
. -
An
Adapter
(given by anAdaptationSpec
) takes a list ofInstance
s and adapts it to a set ofRequest
s to the API (e.g., the model, temperature, number of in-context training examples). Formally, the output is aScenarioState
containing a set ofRequestState
s, where eachRequestState
consists of aRequest
and any metadata used to track the role of thisRequest
(e.g., the relevantInstance
andReference
). -
An
Executor
(given by anExecutionSpec
) executes eachRequest
in theRequestState
to produce aRequestResult
for each one; everything is encapsulated in aScenarioState
. -
A
Metric
(given by aMetricSpec
) takes aScenarioState
containingRequestResults
s and produces a set ofStat
s (e.g., accuracy, accuracy@5, toxicity, bias, etc.). -
A
Runner
is the top-level controller that runs the above steps and is driven by a set ofRunSpec
s.
There are three types of classes:
- Specifications (e.g.,
AdapterSpec
,ExecutionSpec
,RunSpec
): specified manually by the user. Note thatScenario
andMetric
are subclassed, so they are constructed byObjectSpec
, which specifies the subclass name and a free-form dictionary of arguments. - States (e.g.,
Instance
,ScenarioState
,Request
,RequestResult
): these are automatically generated and can be serialized. - Controllers (e.g.,
Scenario
,Adapter
,Executor
,Metric
,Runner
): these have the bulk of the code and should not be serialized.
In order to implement new scenarios:
- Create a new file as a new Python scenario file in the
scenarios
folder. - Within the scenario file, create a
Scenario
class, e.g.YourScenario
. YourScenario
should have a function,get_instances
which returns a list ofInstance
objects which each have a list of (potentially only one)Reference
answers. You may haveCORRECT_TAG
in aReference
instance's tags argument, indicating that it is the correct answer. In addition, you must specify thesplit
of theInstance
as one ofTRAIN_SPLIT
,VALID_SPLIT
, orTEST_SPLIT
constants as inscenario.py
.- Note that you need not enumerate every possible correct answer (nor must
there even necessarily be a correct answer). If necessary, define a new
metric in
metric.py
if one does not exist for your evaluation type. - Make sure to document your scenario well with a clear docstring.
- In addition, specify its
name
,description
, andtags
and define a class__init__
function even if it is simplypass
. - Define a function
get_specname_spec
inrun_specs.py
to retrieve aScenarioSpec
for your scenario using a class name corresponding to the Python path of the class (e.g.benchmark.scenarios.your_scenario.YourScenario
) and any arguments which must be passed as a dictionary ofargs
. - Have the
get_specname_spec
function retrieve anAdapterSpec
for your scenario specifying the type of language model generation which must be performed for the task - Define a
get_metric_spec
function retrieve one or moreMetricSpec
objects for your task, specifying the classname with the Python path of the object, with the same arguments as theScenarioSpec
constructor. - Have the
get_specname_spec
function return aRunSpec
object, with aname
corresponding to the scenario name and any patterns to match in curly braces, ascenario_spec
, anadapter_spec
,metric_specs
, andgroups
. - Add the scenario to
__init__.py
- Attempt to run your task with
venv/bin/benchmark-run -r yourscenarioname:arg=value
whereyourscenarioname
matches thename
specified in YourScenario - Add the spec to dictionary
CANONICAL_RUN_SPEC_FUNCS
inrun_specs.py
.
To apply data augmentation, create a DataAugmenterSpec
with a list of
PerturbationSpec
s and pass it into RunSpec
. The following is an
example:
data_augmenter_spec = DataAugmenterSpec(
perturbation_specs=[
PerturbationSpec(
class_name="benchmark.augmentations.perturbation.ExtraSpacePerturbation",
args={"num_spaces": 5},
)
],
should_perturb_references=False,
should_augment_train_instances=False,
should_include_original_train=False,
should_augment_eval_instances=True,
should_include_original_eval=True,
)
run_spec = RunSpec(
...
data_augmenter_spec=data_augmenter_spec
)
In the example above, the DataPreprocessor
will augment the set of evaluation instances by perturbing
the original set of instances with the ExtraSpacePerturbation
, where spaces in the text are
replaced with num_spaces
number of spaces.
We currently only support applying a single perturbation to an instance instead of chaining multiple perturbations and applying it onto a single instance.
To add a new perturbation to the framework, create a new file at src/benchmark/augmentations
with the name
<Name of perturbation>_perturbation.py
e.g., typo_perturbation.py
. Inside the file, create a new class
(name it <Name of the perturbation>Perturbation
e.g., TypoPerturbation
)
that extends the abstract class Perturbation
and implement the perturb
method which
takes in text and outputs the perturbed text.
Add your new perturbation to src/benchmark/__init__.py
.
Add a test for the new perturbation in test_perturbation.py
.
- Give the tokenizer a name. Use the same name that's used in Hugging Face (e.g., "EleutherAI/gpt-j-6B").
- In
HuggingFaceTokenizers
, we load and cache tokenizers in memory. Add logic to handle the tokenizer in theload_tokenizer
method. - Add a test in
test_huggingface_tokenizer.py
to make sure we can load the tokenizer from Hugging Face. - Add a new class
<Name of tokenizer>WindowService
in file<Name of tokenizer>_window_service.py
. Follow what we did forGPTJWindowService
. - Import the new
WindowService
and map the model(s) to it inWindowServiceFactory
.
Examples of running the benchmark:
venv/bin/benchmark-run
venv/bin/benchmark-run -r mmlu:subject=philosophy
venv/bin/benchmark-run -r synthetic_reasoning_natural:difficulty=easy
venv/bin/benchmark-run -r twitter_aae:demographic=aa
venv/bin/benchmark-run -r copyright:datatag=pilot
venv/bin/benchmark-run -r disinformation:capability=reiteration
venv/bin/benchmark-run -r wikifact:k=2,subject=P31
venv/bin/benchmark-run -r code:dataset=APPS
venv/bin/benchmark-run -r the_pile:subset=OpenSubtitles
venv/bin/benchmark-run -r wikifact:subject=P31
venv/bin/benchmark-run -r raft:subset=ade_corpus_v2
venv/bin/benchmark-run -r natural_qa:mode=closedbook
venv/bin/benchmark-run -r natural_qa:mode=openbook-longans
venv/bin/benchmark-run -r quac
venv/bin/benchmark-run -r wikitext_103
venv/bin/benchmark-run -r blimp:phenomenon=irregular_forms
venv/bin/benchmark-run -r narrative_qa
venv/bin/benchmark-run -r news_qa
venv/bin/benchmark-run -r imdb
venv/bin/benchmark-run -r twitter_aae:demographic=aa
You can also run the benchmark using a local proxy, in which case you have to first start a local server (see instructions above for more details).
To estimate token usage without making any requests, append the --dry-run
option:
venv/bin/benchmark-run -r <RunSpec to estimate token usage> --dry-run
For example, running venv/bin/benchmark-run -r real_toxicity_prompts --dry-run
outputs:
Stats {
MetricName(name='estimated_num_tokens_cost', k=None, split=None, sub_split=None, perturbation=None)[min=505.000, mean=514.957, max=536.000, sum=514957.000 (1000)]
}
where sum
indicates the estimated total number of tokens used for the specific RunSpec
.
For the OpenAI models, we use a GPT-2 Tokenizer to estimate the token usage. The tokenizer will be downloaded and cached when running a dry run.
ssh sc
.- Create a screen session:
screen -S benchmarking
. - Use a john to run the suite:
nlprun --priority high -c 8 -g 0 --memory 64g
. - Go to the source code directory:
cd /u/scr/nlp/crfm/benchmarking/benchmarking
. We have 700 GB of disk space total on/u/scr/nlp/crfm
. - Pull the latest changes:
git pull
. - Activate the Conda environment:
conda activate crfm_benchmarking
- Run
pip install -e .
if there are new dependencies to install.
- Run
- Run
benchmark-present-all.sh
:bash scripts/benchmark-present-all.sh --max-eval-instances 1000 --num-threads 1 --priority 2 --local
. - Exit the screen session:
ctrl+ad
. - To check on the screen session:
screen -r benchmarking
.
ssh sc
.- Create a screen session:
screen -S together
. - Use a john to run the suite:
nlprun --priority high -c 8 -g 0 --memory 64g
. cd /u/scr/nlp/crfm/benchmarking/benchmarking
.- Activate the Conda environment:
conda activate crfm_benchmarking
. - Do a dry run to generate
RequestState
s for all the Together models:bash scripts/generate-together-requests.sh --max-eval-instances 1000 --priority 2 --local
. - Exit the screen session:
ctrl+ad
. - Check on the dry run by streaming the logs:
tail -f dryrun_<Namne of together model>.log
. - The dry run results will be outputted to
benchmark_output/runs/together
. - Once the dry run is done, run
python3 scripts/together/together_export_requests.py benchmark_output/runs/together prod_env/cache/together.sqlite --output-path requests.jsonl
. This command will generate arequests.jsonl
that contains requests that are not in the cache (prod_env/cache/together.sqlite
). - Upload
requests.jsonl
to CodaLab:- Log on to CodaLab:
cl work main::0xbd9f3df457854889bda8ac114efa8061
. - Upload by running
cl upload requests.jsonl
.
- Log on to CodaLab:
- Share the link to the CodaLab bundle with our collaborators.
ssh scdt
cd /u/scr/nlp/crfm/benchmarking/benchmarking
- Download the results from CodaLab:
cl download <UUID of the results bundle>
. - Run:
python3 scripts/together/together_import_results.py <Path to results jsonl file> prod_env/cache/together.sqlite
. This will update the cache with requests and their results.
- Run
venv/bin/benchmark-present --output-path src/proxy/static/benchmark_output
. - Visit the benchmarking status page.
ssh scdt
.cd /u/scr/nlp/crfm/benchmarking/benchmarking
.- Create a screen session:
screen -S reproducible
. conda activate crfm_benchmarking
.- Run
python3 scripts/verify_reproducibility.py --models-to-run openai/davinci openai/code-cushman-001 together/gpt-neox-20b --conf-path src/benchmark/presentation/run_specs.conf --max-eval-instances 1000 --priority 2 &> reproducible.log
. - Check the result at
reproducible.log
.
To contribute to this project, install the dependencies and git hook scripts:
./pre-commit.sh && pre-commit install
To run all unit tests:
python -m pytest
Append -vv
to output the full diff and results:
python -m pytest -vv
To run a specific file, simply specify the path:
python -m pytest <path/to/file> -vv