Skip to content

Commit

Permalink
resolve merge conflict
Browse files Browse the repository at this point in the history
  • Loading branch information
teetone committed Sep 4, 2022
2 parents ef55f7f + eda78b8 commit 6dfc129
Show file tree
Hide file tree
Showing 130 changed files with 14,947 additions and 3,921 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@ jobs:
# Installs dependencies and performs static code checks
- run: ./pre-commit.sh
- name: Run tests
run: venv/bin/pytest
# Skip ICE tokenizer tests. GHA is having trouble downloading ice_text.model.
run: venv/bin/pytest --ignore src/benchmark/window_services/test_ice_window_service.py --ignore src/proxy/clients/test_ice_tokenizer_client.py
env:
TEST: ${{ matrix.test }}
VERSION: ${{ github.head_ref || 'main' }}
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ benchmark_output
proxy_api_key.txt
microsoft_client.lock
*.log
*.out
nltk_data/

# Percy's stuff
Expand Down
434 changes: 17 additions & 417 deletions README.md

Large diffs are not rendered by default.

1 change: 0 additions & 1 deletion ai21_random_window.py

This file was deleted.

103 changes: 103 additions & 0 deletions docs/benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Running the benchmark

Examples of running the benchmark:

venv/bin/benchmark-run
venv/bin/benchmark-run -r mmlu:subject=philosophy --suite SUITE_NAME
venv/bin/benchmark-run -r synthetic_reasoning_natural:difficulty=easy --suite SUITE_NAME
venv/bin/benchmark-run -r twitter_aae:demographic=aa --suite SUITE_NAME
venv/bin/benchmark-run -r copyright:datatag=pilot --suite SUITE_NAME
venv/bin/benchmark-run -r disinformation:capability=reiteration --suite SUITE_NAME
venv/bin/benchmark-run -r wikifact:k=2,subject=P31 --suite SUITE_NAME
venv/bin/benchmark-run -r code:dataset=APPS --suite SUITE_NAME
venv/bin/benchmark-run -r the_pile:subset=OpenSubtitles --suite SUITE_NAME
venv/bin/benchmark-run -r wikifact:subject=P31 --suite SUITE_NAME
venv/bin/benchmark-run -r raft:subset=ade_corpus_v2 --suite SUITE_NAME
venv/bin/benchmark-run -r natural_qa:mode=closedbook --suite SUITE_NAME
venv/bin/benchmark-run -r natural_qa:mode=openbook-longans --suite SUITE_NAME
venv/bin/benchmark-run -r quac --suite SUITE_NAME
venv/bin/benchmark-run -r wikitext_103 --suite SUITE_NAME
venv/bin/benchmark-run -r blimp:phenomenon=irregular_forms --suite SUITE_NAME
venv/bin/benchmark-run -r narrative_qa --suite SUITE_NAME
venv/bin/benchmark-run -r news_qa --suite SUITE_NAME
venv/bin/benchmark-run -r imdb --suite SUITE_NAME
venv/bin/benchmark-run -r twitter_aae:demographic=aa --suite SUITE_NAME

You can also run the benchmark using a local proxy, in which case you have to
first start a local server (see instructions above for more details).

## To estimate token usage

To estimate token usage without making any requests, append the `--dry-run` option:

venv/bin/benchmark-run -r <RunSpec to estimate token usage> --dry-run

For example, running `venv/bin/benchmark-run -r real_toxicity_prompts --dry-run` outputs:

```text
Stats {
MetricName(name='estimated_num_tokens_cost', k=None, split=None, sub_split=None, perturbation=None)[min=505.000, mean=514.957, max=536.000, sum=514957.000 (1000)]
}
```

where `sum` indicates the estimated total number of tokens used for the specific `RunSpec`.

For the OpenAI models, we use a
[GPT-2 Tokenizer](https://github.com/stanford-crfm/benchmarking/blob/master/src/proxy/tokenizer/openai_token_counter.py#L12)
to estimate the token usage. The tokenizer will be downloaded and cached when running a dry run.

## Final benchmarking (Infrastructure team only)

1. `ssh sc`.
1. Go to the source code directory: `cd /u/scr/nlp/crfm/benchmarking/benchmarking`.
We have 700 GB of disk space total on `/u/scr/nlp/crfm`.
1. Pull the latest changes: `git pull`.
1. Activate the Conda environment: `conda activate crfm_benchmarking`
1. Run `./pre-commit.sh` if there are new dependencies to install.
1. Run `bash scripts/run-all-stanford.sh --suite <Suite name>` e.g.,
`bash scripts/run-all-stanford.sh --suite v1`.
1. After the run for all the models has finished, run the remaining commands the script outputs.

## Offline evaluation for `TogetherClient` models

### Exporting requests

1. `ssh sc`.
1. Create a screen session: `screen -S together`.
1. Use a john to run the suite:
`nlprun --job-name queriestogether --priority high -a crfm_benchmarking -c 8 -g 0 --memory 64g -w /u/scr/nlp/crfm/benchmarking/benchmarking`.
1. Do a dry run to generate `RequestState`s for all the Together models:
`bash scripts/generate-together-requests.sh --max-eval-instances 1000 --priority 2 --local`.
1. Exit the screen session: `ctrl+ad`.
1. Check on the dry run by streaming the logs: `tail -f dryrun_<Name of together model>.log`.
1. The dry run results will be outputted to `benchmark_output/runs/together`.
1. Once the dry run is done, run
`python3 scripts/together/together_export_requests.py benchmark_output/runs/together prod_env/cache/together.sqlite --output-path requests.jsonl`.
This command will generate a `requests.jsonl` that contains requests that are not in the cache (`prod_env/cache/together.sqlite`).
1. Upload `requests.jsonl` to CodaLab:
1. Log on to CodaLab: `cl work main::0xbd9f3df457854889bda8ac114efa8061`.
1. Upload by running `cl upload requests.jsonl`.
1. Share the link to the CodaLab bundle with our collaborators.

### Importing results

1. `ssh scdt`
1. `cd /u/scr/nlp/crfm/benchmarking/benchmarking`
1. Download the results from CodaLab: `cl download <UUID of the results bundle>`.
1. Run: `python3 scripts/together/together_import_results.py <Path to results jsonl file> prod_env/cache/together.sqlite`.
This will update the cache with requests and their results.

## To visualize results at crfm-models.stanford.edu

1. Run `venv/bin/benchmark-present --output-path src/proxy/static/benchmark_output`.
1. Visit the [benchmarking status page](https://crfm-models.stanford.edu/static/benchmarking.html).

### To verify that the Scenario construction and generation of prompts are reproducible

1. `ssh scdt`.
1. `cd /u/scr/nlp/crfm/benchmarking/benchmarking`.
1. Create a screen session: `screen -S reproducible`.
1. `conda activate crfm_benchmarking`.
1. Run `python3 scripts/verify_reproducibility.py --models-to-run openai/davinci openai/code-cushman-001 together/gpt-neox-20b
--conf-path src/benchmark/presentation/run_specs.conf --max-eval-instances 1000 --priority 2 &> reproducible.log`.
1. Check the result at `reproducible.log`.
153 changes: 153 additions & 0 deletions docs/code.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Code structure

Here's a birds-eye view of how the benchmarking process interacts with the main
classes (see `benchmark`):

- A `Scenario` (given by a `ScenarioSpec`) specifies a task and a data
distribution. It specifies a set of `Instance`s, where each `Instance` has
an input (e.g., question) and a set of `Reference` outputs (e.g., multiple
choice answers).

- A `DataPreprocessor` takes in a `Scenario` and produces a list of `Instance`s
Each `Instance` is given a unique ID. The set of `Instance`s is augmented
according to `DataAugmenterSpec`.

- An `Adapter` (given by an `AdaptationSpec`) takes a list of `Instance`s and
adapts it to a set of `Request`s to the API (e.g., the model, temperature,
number of in-context training examples). Formally, the output
is a `ScenarioState` containing a set of `RequestState`s, where each
`RequestState` consists of a `Request` and any metadata used to track the
role of this `Request` (e.g., the relevant `Instance` and `Reference`).

- An `Executor` (given by an `ExecutionSpec`) executes each `Request` in the
`RequestState` to produce a `RequestResult` for each one; everything is
encapsulated in a `ScenarioState`.

- A `Metric` (given by a `MetricSpec`) takes a `ScenarioState` containing
`RequestResults`s and produces a set of `Stat`s (e.g., accuracy, accuracy@5,
toxicity, bias, etc.).

- A `Runner` is the top-level controller that runs the above steps and is
driven by a set of `RunSpec`s.

There are three types of classes:

- Specifications (e.g., `AdapterSpec`, `ExecutionSpec`, `RunSpec`):
specified manually by the user. Note that `Scenario` and `Metric` are
subclassed, so they are constructed by `ObjectSpec`, which specifies the
subclass name and a free-form dictionary of arguments.
- States (e.g., `Instance`, `ScenarioState`, `Request`, `RequestResult`): these
are automatically generated and can be serialized.
- Controllers (e.g., `Scenario`, `Adapter`, `Executor`, `Metric`, `Runner`):
these have the bulk of the code and should not be serialized.

## Adding new scenarios

In order to implement new scenarios:

1. Create a new file as a new Python scenario file in the `scenarios` folder.
2. Within the scenario file, create a `Scenario` class, e.g. `YourScenario`.
3. `YourScenario` should implement `get_instances`, a method returning a
list of `Instance` objects. Each `Instance` must have a list of (potentially one)
`Reference` answers: a correct answer may be indicated with a `CORRECT_TAG` in
a `Reference` instance's `tags` argument. In addition, you
must specify the `split` of the `Instance` as one of `TRAIN_SPLIT`,
`VALID_SPLIT`, or `TEST_SPLIT` constants as in `scenario.py`.
4. Note that you need not enumerate every possible correct answer (nor must
there even necessarily be a correct answer).
5. Make sure to document your scenario well with a clear docstring.
6. In addition, specify its `name`, `description`, and `tags` and define a class
`__init__` function even if it is simply `pass`.
7. Define a function `get_specname_spec` in `run_specs.py` to retrieve a `ScenarioSpec`
for your scenario using a class name corresponding to the Python path of
the class (e.g. `benchmark.scenarios.your_scenario.YourScenario`) and any
arguments which must be passed as a dictionary of `args`.
8. Have the `get_specname_spec` function retrieve an `AdapterSpec` for your
scenario specifying the type of language model generation which must be
performed for the task.
9. Identify the appropriate metric for your task in one of the `*_metrics.py` files.
If the metric you'd like to use does not exist, follow the directions in [Adding new metrics](#adding-new-metrics).
Many will be in `basic_metrics.py`.
10. Have a `get_metric_spec` function retrieve one or more `MetricSpec`
objects for your task, specifying the classname with the Python path of
the object, with the same arguments as the `ScenarioSpec` constructor.
11. Have the `get_specname_spec` function return a `RunSpec` object, with a
`name` corresponding to the scenario name and any patterns to match in
curly braces, a `scenario_spec`, an `adapter_spec`, `metric_specs`,
and `groups`.
12. Add the scenario to `__init__.py`
13. Attempt to run your task with
`venv/bin/benchmark-run -r yourscenarioname:arg=value` where
`yourscenarioname` matches the `name` specified in YourScenario
14. Add the spec to dictionary `CANONICAL_RUN_SPEC_FUNCS` in `run_specs.py`.
15. Update `src/proxy/static/contamination.yaml` with models that we trained on your scenario (i.e. contaminated).


## Adding new metrics

To add a new metric:
1. If the metric is task-specific, create a new `yourtask_metrics.py` file.
Otherwise, if the metric is generic and likely to be widely used, add it
to `basic_metrics.py`.
2. If you are creating a task-specific metric, create a `YourTaskMetric`
which inherits from `Metric` in `metric.py`.
3. Define methods `__init__` and `evaluate_generation` returning a list of `Stat` objects.
4. Each `Stat` should correspond to a distinct aggregate measurement over the generated examples.
Some may have one metric (e.g. accuracy), while others may quantify multiple aspects
(e.g. multiple distance metrics).
5. For each `value` generated for a `Stat`, add it to `yourstat` using `yourstat.add(value)`.
Usually, there will only be one value for each `Stat`, but multiple can be used, e.g. to show variance.
6. Add your metric to `__init__.py`.

## Data augmentations

To apply data augmentation, create a `DataAugmenterSpec` with a list of
`PerturbationSpec`s and pass it into `RunSpec`. The following is an
example:

```python
data_augmenter_spec = DataAugmenterSpec(
perturbation_specs=[
PerturbationSpec(
class_name="benchmark.augmentations.perturbation.ExtraSpacePerturbation",
args={"num_spaces": 5},
)
],
should_perturb_references=False,
should_augment_train_instances=False,
should_include_original_train=False,
should_augment_eval_instances=True,
should_include_original_eval=True,
)
run_spec = RunSpec(
...
data_augmenter_spec=data_augmenter_spec
)
```

In the example above, the `DataPreprocessor` will augment the set of evaluation instances by perturbing
the original set of instances with the `ExtraSpacePerturbation`, where spaces in the text are
replaced with `num_spaces` number of spaces.

We currently only support applying a single perturbation to an instance instead of chaining
multiple perturbations and applying it onto a single instance.

### Adding a new perturbation

To add a new perturbation to the framework, create a new file at `src/benchmark/augmentations` with the name
`<Name of perturbation>_perturbation.py` e.g., `typo_perturbation.py`. Inside the file, create a new class
(name it `<Name of the perturbation>Perturbation` e.g., `TypoPerturbation`)
that extends the abstract class `Perturbation` and implement the `perturb` method which
takes in text and outputs the perturbed text.
Add your new perturbation to `src/benchmark/__init__.py`.
Add a test for the new perturbation in `test_perturbation.py`.

## Supporting new Hugging Face tokenizers

1. Give the tokenizer a name. Use the same name that's used in Hugging Face (e.g., "EleutherAI/gpt-j-6B").
2. In `HuggingFaceTokenizers`, we load and cache tokenizers in memory. Add logic to handle
the tokenizer in the `load_tokenizer` method.
3. Add a test in `test_huggingface_tokenizer.py` to make sure we can load the tokenizer from Hugging Face.
4. Add a new class `<Name of tokenizer>WindowService` in file `<Name of tokenizer>_window_service.py`.
Follow what we did for `GPTJWindowService`.
5. Import the new `WindowService` and map the model(s) to it in `WindowServiceFactory`.
120 changes: 120 additions & 0 deletions docs/deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Deploying to production (for maintainers)

The production version of the proxy is running on `crfm-models.stanford.edu`;
you need to get permission to get ssh access.

## One-time setup

This is done, but just for the record:

laptop:$ ssh crfm-models.stanford.edu
crfm-models:$ cd /home
crfm-models:$ git clone [email protected]:stanford-crfm/benchmarking
crfm-models:$ cd benchmarking
crfm-models:$ mkdir prod_env
crfm-models:$ echo '{"api_key": "crfm"}' > prod_env/accounts.jsonl
laptop:$ rsync -arvz prod_env/credentials.conf crfm-models.stanford.edu:/home/benchmarking/prod_env

## Perspective API

We use Google's [Perspective API](https://www.perspectiveapi.com) to calculate the toxicity of completions.
To send requests to PerspectiveAPI, we need to generate an API key from GCP. Follow the
[Get Started guide](https://developers.perspectiveapi.com/s/docs-get-started)
to request the service and the [Enable the API guide](https://developers.perspectiveapi.com/s/docs-enable-the-api)
to generate the API key. Once you have a valid API key, add an entry to `credentials.conf`:

```
perspectiveApiKey: <Generated API key>
```

By default, Perspective API allows only 1 query per second. Fill out this
[form](https://developers.perspectiveapi.com/s/request-quota-increase) to increase the request quota.

The [current API key](https://console.cloud.google.com/apis/api/commentanalyzer.googleapis.com/overview?authuser=1&project=hai-gcp-models)
we are using in production was created with the `hai-gcp-models` account and allows 200 queries per second.
**The API key expires on 7/30/2022.**

## SSL

The SSL certificate, CSR and private key for crfm-models.stanford.edu is stored at `/home/ssl`.
**The current SSL certificate expires on 12/30/2022.**

To renew the SSL certificate, follow these steps:

1. Fill out this [form](https://certificate.stanford.edu/cert-request):
1. Log on with your SUNet ID. You must be an admin in order to submit a request.
1. For `Server Name`, put `crfm-models.stanford.edu`.
1. For `Server type`, select `OTHER`.
1. For `Contact group/mailman address`, enter your Stanford email address.
1. Under `Copy and paste your CSR`, paste the content of `/home/ssl/public.csr`.
1. Leave the optional fields blank and click `Submit`.
1. You should receive your certificate by email within 2 business days.
2. Once you receive the SSL cert, concatenate the contents of `X509 Certificate only, Base64 encoded`
with the contents of `X509 Intermediates/root only Reverse, Base64 encoded`
and place it at path `/home/ssl/crfm-models.crt`. `crfm-models.crt` should look something like this:

```text
-----BEGIN CERTIFICATE-----
(Your Primary SSL certificate: .crt)
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
(Your Intermediate certificate: reversed.crt)
-----END CERTIFICATE-----
```
3. Restart the server.
4. Open the [website](https://crfm-models.stanford.edu) in a browser and verify the connection is secure.

### Misplaced private key or CSR

If, for whatever reason, the private key or CSR is misplaced, generate new ones by running:

`sudo openssl req -new -nodes -newkey rsa:2048 -keyout private.key -out public.csr`

and fill out the form:

```text
Country Name (2 letter code) [AU]:US
State or Province Name (full name) [Some-State]:California
Locality Name (eg, city) []:Stanford
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Stanford University
Organizational Unit Name (eg, section) []:CRFM
Common Name (e.g. server FQDN or YOUR name) []:crfm-models.stanford.edu
Email Address []:
Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:
```

Then, follow the steps above to request for a new SSL certificate.

## Deployment

Every time we need to deploy, do the following.

Update the code:

laptop:$ ssh crfm-models.stanford.edu
crfm-models:$ cd /home/benchmarking
crfm-models:$ git pull
crfm-models:$ ./pre-commit.sh

If everything looks okay:

ssh [email protected]

# Switch into the screen session
crfm-models:$ screen -r deploy

# Hit ctrl-c to kill the existing process
# Restart the server
sudo venv/bin/proxy-server -p 443 --ssl-key-file /home/ssl/private.key --ssl-cert-file /home/ssl/crfm-models.crt --workers 16 &> server.log

# Exit the screen session: ctrl-ad

The recommended number of Gunicorn workers is twice the number of cores.
crfm-models.stanford.edu has 8 cores (verified with `nproc`) * 2 = 16 workers.

Double check that the [website](https://crfm-models.stanford.edu) still works.
The server logs can be streamed by running: `tail -f /home/benchmarking/server.log`.
Loading

0 comments on commit 6dfc129

Please sign in to comment.