resolve merge conflict

danielz02 · Sep 4, 2022 · 6dfc129 · 6dfc129
2 parents ef55f7f + eda78b8
commit 6dfc129
Show file tree

Hide file tree

Showing 130 changed files with 14,947 additions and 3,921 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -46,7 +46,8 @@ jobs:
       # Installs dependencies and performs static code checks
       - run: ./pre-commit.sh
       - name: Run tests
-        run: venv/bin/pytest
+        # Skip ICE tokenizer tests. GHA is having trouble downloading ice_text.model.
+        run: venv/bin/pytest --ignore src/benchmark/window_services/test_ice_window_service.py --ignore src/proxy/clients/test_ice_tokenizer_client.py
         env:
           TEST: ${{ matrix.test }}
           VERSION: ${{ github.head_ref || 'main' }}

diff --git a/.gitignore b/.gitignore
@@ -10,6 +10,7 @@ benchmark_output
 proxy_api_key.txt
 microsoft_client.lock
 *.log
+*.out
 nltk_data/
 
 # Percy's stuff

diff --git a/README.md b/README.md
diff --git a/ai21_random_window.py b/ai21_random_window.py
diff --git a/docs/benchmark.md b/docs/benchmark.md
@@ -0,0 +1,103 @@
+# Running the benchmark
+
+Examples of running the benchmark:
+
+    venv/bin/benchmark-run
+    venv/bin/benchmark-run -r mmlu:subject=philosophy --suite SUITE_NAME
+    venv/bin/benchmark-run -r synthetic_reasoning_natural:difficulty=easy --suite SUITE_NAME
+    venv/bin/benchmark-run -r twitter_aae:demographic=aa --suite SUITE_NAME
+    venv/bin/benchmark-run -r copyright:datatag=pilot --suite SUITE_NAME
+    venv/bin/benchmark-run -r disinformation:capability=reiteration --suite SUITE_NAME
+    venv/bin/benchmark-run -r wikifact:k=2,subject=P31 --suite SUITE_NAME
+    venv/bin/benchmark-run -r code:dataset=APPS --suite SUITE_NAME
+    venv/bin/benchmark-run -r the_pile:subset=OpenSubtitles --suite SUITE_NAME
+    venv/bin/benchmark-run -r wikifact:subject=P31 --suite SUITE_NAME
+    venv/bin/benchmark-run -r raft:subset=ade_corpus_v2 --suite SUITE_NAME
+    venv/bin/benchmark-run -r natural_qa:mode=closedbook --suite SUITE_NAME
+    venv/bin/benchmark-run -r natural_qa:mode=openbook-longans --suite SUITE_NAME
+    venv/bin/benchmark-run -r quac --suite SUITE_NAME
+    venv/bin/benchmark-run -r wikitext_103 --suite SUITE_NAME
+    venv/bin/benchmark-run -r blimp:phenomenon=irregular_forms --suite SUITE_NAME
+    venv/bin/benchmark-run -r narrative_qa --suite SUITE_NAME
+    venv/bin/benchmark-run -r news_qa --suite SUITE_NAME
+    venv/bin/benchmark-run -r imdb --suite SUITE_NAME
+    venv/bin/benchmark-run -r twitter_aae:demographic=aa --suite SUITE_NAME
+
+You can also run the benchmark using a local proxy, in which case you have to
+first start a local server (see instructions above for more details).
+
+## To estimate token usage
+
+To estimate token usage without making any requests, append the `--dry-run` option:
+
+    venv/bin/benchmark-run -r <RunSpec to estimate token usage> --dry-run
+
+For example, running `venv/bin/benchmark-run -r real_toxicity_prompts --dry-run` outputs:
+
+```text
+  Stats {
+    MetricName(name='estimated_num_tokens_cost', k=None, split=None, sub_split=None, perturbation=None)[min=505.000, mean=514.957, max=536.000, sum=514957.000 (1000)]
+  }
+```
+
+where `sum` indicates the estimated total number of tokens used for the specific `RunSpec`.
+
+For the OpenAI models, we use a
+[GPT-2 Tokenizer](https://github.com/stanford-crfm/benchmarking/blob/master/src/proxy/tokenizer/openai_token_counter.py#L12)
+to estimate the token usage. The tokenizer will be downloaded and cached when running a dry run.
+
+## Final benchmarking (Infrastructure team only)
+
+1. `ssh sc`.
+1. Go to the source code directory: `cd /u/scr/nlp/crfm/benchmarking/benchmarking`.
+   We have 700 GB of disk space total on `/u/scr/nlp/crfm`.
+1. Pull the latest changes: `git pull`.
+1. Activate the Conda environment: `conda activate crfm_benchmarking`
+   1. Run `./pre-commit.sh` if there are new dependencies to install.
+1. Run `bash scripts/run-all-stanford.sh --suite <Suite name>` e.g.,
+   `bash scripts/run-all-stanford.sh --suite v1`.
+1. After the run for all the models has finished, run the remaining commands the script outputs.
+
+## Offline evaluation for `TogetherClient` models
+
+### Exporting requests
+
+1. `ssh sc`.
+1. Create a screen session: `screen -S together`.
+1. Use a john to run the suite: 
+   `nlprun --job-name queriestogether --priority high -a crfm_benchmarking -c 8 -g 0 --memory 64g -w /u/scr/nlp/crfm/benchmarking/benchmarking`.
+1. Do a dry run to generate `RequestState`s for all the Together models: 
+   `bash scripts/generate-together-requests.sh --max-eval-instances 1000 --priority 2 --local`.
+1. Exit the screen session: `ctrl+ad`.
+1. Check on the dry run by streaming the logs: `tail -f dryrun_<Name of together model>.log`.
+1. The dry run results will be outputted to `benchmark_output/runs/together`.
+1. Once the dry run is done, run
+   `python3 scripts/together/together_export_requests.py benchmark_output/runs/together prod_env/cache/together.sqlite --output-path requests.jsonl`.
+   This command will generate a `requests.jsonl` that contains requests that are not in the cache (`prod_env/cache/together.sqlite`).
+1. Upload `requests.jsonl` to CodaLab:
+    1. Log on to CodaLab: `cl work main::0xbd9f3df457854889bda8ac114efa8061`.
+    1. Upload by running `cl upload requests.jsonl`.
+1. Share the link to the CodaLab bundle with our collaborators.
+
+### Importing results
+
+1. `ssh scdt`
+1. `cd /u/scr/nlp/crfm/benchmarking/benchmarking`
+1. Download the results from CodaLab: `cl download <UUID of the results bundle>`.
+1. Run: `python3 scripts/together/together_import_results.py <Path to results jsonl file> prod_env/cache/together.sqlite`.
+   This will update the cache with requests and their results.
+
+## To visualize results at crfm-models.stanford.edu
+
+1. Run `venv/bin/benchmark-present --output-path src/proxy/static/benchmark_output`.
+1. Visit the [benchmarking status page](https://crfm-models.stanford.edu/static/benchmarking.html).
+
+### To verify that the Scenario construction and generation of prompts are reproducible
+
+1. `ssh scdt`.
+1. `cd /u/scr/nlp/crfm/benchmarking/benchmarking`.
+1. Create a screen session: `screen -S reproducible`.
+1. `conda activate crfm_benchmarking`.
+1. Run `python3 scripts/verify_reproducibility.py --models-to-run openai/davinci openai/code-cushman-001 together/gpt-neox-20b
+   --conf-path src/benchmark/presentation/run_specs.conf --max-eval-instances 1000 --priority 2 &> reproducible.log`.
+1. Check the result at `reproducible.log`.
diff --git a/docs/code.md b/docs/code.md
@@ -0,0 +1,153 @@
+# Code structure
+
+Here's a birds-eye view of how the benchmarking process interacts with the main
+classes (see `benchmark`):
+
+- A `Scenario` (given by a `ScenarioSpec`) specifies a task and a data
+  distribution.  It specifies a set of `Instance`s, where each `Instance` has
+  an input (e.g., question) and a set of `Reference` outputs (e.g., multiple
+  choice answers).
+
+- A `DataPreprocessor` takes in a `Scenario` and produces a list of `Instance`s
+  Each `Instance` is given a unique ID. The set of `Instance`s is augmented
+  according to `DataAugmenterSpec`.
+
+- An `Adapter` (given by an `AdaptationSpec`) takes a list of `Instance`s and
+  adapts it to a set of `Request`s to the API (e.g., the model, temperature,
+  number of in-context training examples).  Formally, the output
+  is a `ScenarioState` containing a set of `RequestState`s, where each
+  `RequestState` consists of a `Request` and any metadata used to track the
+  role of this `Request` (e.g., the relevant `Instance` and `Reference`).
+
+- An `Executor` (given by an `ExecutionSpec`) executes each `Request` in the
+  `RequestState` to produce a `RequestResult` for each one; everything is
+  encapsulated in a `ScenarioState`.
+
+- A `Metric` (given by a `MetricSpec`) takes a `ScenarioState` containing
+  `RequestResults`s and produces a set of `Stat`s (e.g., accuracy, accuracy@5,
+  toxicity, bias, etc.).
+
+- A `Runner` is the top-level controller that runs the above steps and is
+  driven by a set of `RunSpec`s.
+
+There are three types of classes:
+
+- Specifications (e.g., `AdapterSpec`, `ExecutionSpec`, `RunSpec`):
+  specified manually by the user.  Note that `Scenario` and `Metric` are
+  subclassed, so they are constructed by `ObjectSpec`, which specifies the
+  subclass name and a free-form dictionary of arguments.
+- States (e.g., `Instance`, `ScenarioState`, `Request`, `RequestResult`): these
+  are automatically generated and can be serialized.
+- Controllers (e.g., `Scenario`, `Adapter`, `Executor`, `Metric`, `Runner`):
+  these have the bulk of the code and should not be serialized.
+
+## Adding new scenarios
+
+In order to implement new scenarios:
+
+1. Create a new file as a new Python scenario file in the `scenarios` folder.
+2. Within the scenario file, create a `Scenario` class, e.g. `YourScenario`.
+3. `YourScenario` should implement `get_instances`, a method returning a 
+   list of `Instance` objects. Each `Instance` must have a list of (potentially one)
+   `Reference` answers: a correct answer may be indicated with a `CORRECT_TAG` in 
+   a `Reference` instance's `tags` argument. In addition, you 
+   must specify the `split` of the `Instance` as one of `TRAIN_SPLIT`,
+   `VALID_SPLIT`, or `TEST_SPLIT` constants as in `scenario.py`.
+4. Note that you need not enumerate every possible correct answer (nor must
+   there even necessarily be a correct answer). 
+5. Make sure to document your scenario well with a clear docstring. 
+6. In addition, specify its `name`, `description`, and `tags` and define a class
+   `__init__` function even if it is simply `pass`.
+7. Define a function `get_specname_spec` in `run_specs.py` to retrieve a `ScenarioSpec` 
+   for your scenario using a class name corresponding to the Python path of 
+   the class (e.g. `benchmark.scenarios.your_scenario.YourScenario`) and any 
+   arguments which must be passed as a dictionary of `args`.
+8. Have the `get_specname_spec` function retrieve an `AdapterSpec` for your
+   scenario specifying the type of language model generation which must be 
+   performed for the task.
+9. Identify the appropriate metric for your task in one of the `*_metrics.py` files.
+   If the metric you'd like to use does not exist, follow the directions in [Adding new metrics](#adding-new-metrics).
+   Many will be in `basic_metrics.py`.
+10. Have a `get_metric_spec` function retrieve one or more `MetricSpec`
+   objects for your task, specifying the classname with the Python path of
+   the object, with the same arguments as the `ScenarioSpec` constructor.
+11. Have the `get_specname_spec` function return a `RunSpec` object, with a 
+   `name` corresponding to the scenario name and any patterns to match in 
+   curly braces, a `scenario_spec`, an `adapter_spec`, `metric_specs`, 
+   and `groups`. 
+12. Add the scenario to `__init__.py`
+13. Attempt to run your task with
+    `venv/bin/benchmark-run -r yourscenarioname:arg=value` where 
+    `yourscenarioname` matches the `name` specified in YourScenario
+14. Add the spec to dictionary `CANONICAL_RUN_SPEC_FUNCS` in `run_specs.py`.
+15. Update `src/proxy/static/contamination.yaml` with models that we trained on your scenario (i.e. contaminated).
+
+
+## Adding new metrics
+
+To add a new metric:
+1. If the metric is task-specific, create a new `yourtask_metrics.py` file. 
+   Otherwise, if the metric is generic and likely to be widely used, add it
+   to `basic_metrics.py`.
+2. If you are creating a task-specific metric, create a `YourTaskMetric` 
+   which inherits from `Metric` in `metric.py`.
+3. Define methods `__init__` and `evaluate_generation` returning a list of `Stat` objects.
+4. Each `Stat` should correspond to a distinct aggregate measurement over the generated examples. 
+   Some may have one metric (e.g. accuracy), while others may quantify multiple aspects
+   (e.g. multiple distance metrics). 
+5. For each `value` generated for a `Stat`, add it to `yourstat` using `yourstat.add(value)`. 
+   Usually, there will only be one value for each `Stat`, but multiple can be used, e.g. to show variance.
+6. Add your metric to `__init__.py`.
+
+## Data augmentations
+
+To apply data augmentation, create a `DataAugmenterSpec` with a list of
+`PerturbationSpec`s and pass it into `RunSpec`. The following is an
+example:
+
+```python
+    data_augmenter_spec = DataAugmenterSpec(
+        perturbation_specs=[
+            PerturbationSpec(
+                class_name="benchmark.augmentations.perturbation.ExtraSpacePerturbation",
+                args={"num_spaces": 5},
+            )
+        ],
+        should_perturb_references=False,
+        should_augment_train_instances=False,
+        should_include_original_train=False,
+        should_augment_eval_instances=True,
+        should_include_original_eval=True,
+    )
+    run_spec = RunSpec(
+        ...
+        data_augmenter_spec=data_augmenter_spec
+    )
+```
+
+In the example above, the `DataPreprocessor` will augment the set of evaluation instances by perturbing
+the original set of instances with the `ExtraSpacePerturbation`, where spaces in the text are
+replaced with `num_spaces` number of spaces.
+
+We currently only support applying a single perturbation to an instance instead of chaining
+multiple perturbations and applying it onto a single instance.
+
+### Adding a new perturbation
+
+To add a new perturbation to the framework, create a new file at `src/benchmark/augmentations` with the name
+`<Name of perturbation>_perturbation.py` e.g., `typo_perturbation.py`. Inside the file, create a new class
+(name it `<Name of the perturbation>Perturbation` e.g., `TypoPerturbation`)
+that extends the abstract class `Perturbation` and implement the `perturb` method which
+takes in text and outputs the perturbed text.
+Add your new perturbation to `src/benchmark/__init__.py`.
+Add a test for the new perturbation in `test_perturbation.py`.
+
+## Supporting new Hugging Face tokenizers
+
+1. Give the tokenizer a name. Use the same name that's used in Hugging Face (e.g., "EleutherAI/gpt-j-6B").
+2. In `HuggingFaceTokenizers`, we load and cache tokenizers in memory. Add logic to handle
+   the tokenizer in the `load_tokenizer` method.
+3. Add a test in `test_huggingface_tokenizer.py` to make sure we can load the tokenizer from Hugging Face.
+4. Add a new class `<Name of tokenizer>WindowService` in file `<Name of tokenizer>_window_service.py`.
+   Follow what we did for `GPTJWindowService`.
+5. Import the new `WindowService` and map the model(s) to it in `WindowServiceFactory`.
diff --git a/docs/deployment.md b/docs/deployment.md
@@ -0,0 +1,120 @@
+# Deploying to production (for maintainers)
+
+The production version of the proxy is running on `crfm-models.stanford.edu`;
+you need to get permission to get ssh access.
+
+## One-time setup
+
+This is done, but just for the record:
+
+    laptop:$ ssh crfm-models.stanford.edu
+    crfm-models:$ cd /home
+    crfm-models:$ git clone [email protected]:stanford-crfm/benchmarking
+    crfm-models:$ cd benchmarking
+    crfm-models:$ mkdir prod_env
+    crfm-models:$ echo '{"api_key": "crfm"}' > prod_env/accounts.jsonl
+    laptop:$ rsync -arvz prod_env/credentials.conf crfm-models.stanford.edu:/home/benchmarking/prod_env
+
+## Perspective API
+
+We use Google's [Perspective API](https://www.perspectiveapi.com) to calculate the toxicity of completions.
+To send requests to PerspectiveAPI, we need to generate an API key from GCP. Follow the
+[Get Started guide](https://developers.perspectiveapi.com/s/docs-get-started)
+to request the service and the [Enable the API guide](https://developers.perspectiveapi.com/s/docs-enable-the-api)
+to generate the API key. Once you have a valid API key, add an entry to `credentials.conf`:
+
+```
+perspectiveApiKey: <Generated API key>
+```
+
+By default, Perspective API allows only 1 query per second. Fill out this
+[form](https://developers.perspectiveapi.com/s/request-quota-increase) to increase the request quota.
+
+The [current API key](https://console.cloud.google.com/apis/api/commentanalyzer.googleapis.com/overview?authuser=1&project=hai-gcp-models)
+we are using in production was created with the `hai-gcp-models` account and allows 200 queries per second.
+**The API key expires on 7/30/2022.**
+
+## SSL
+
+The SSL certificate, CSR and private key for crfm-models.stanford.edu is stored at `/home/ssl`.
+**The current SSL certificate expires on 12/30/2022.**
+
+To renew the SSL certificate, follow these steps:
+
+1. Fill out this [form](https://certificate.stanford.edu/cert-request):
+    1. Log on with your SUNet ID. You must be an admin in order to submit a request.
+    1. For `Server Name`, put `crfm-models.stanford.edu`.
+    1. For `Server type`, select `OTHER`.
+    1. For `Contact group/mailman address`, enter your Stanford email address.
+    1. Under `Copy and paste your CSR`, paste the content of `/home/ssl/public.csr`.
+    1. Leave the optional fields blank and click `Submit`.
+    1. You should receive your certificate by email within 2 business days.
+2. Once you receive the SSL cert, concatenate the contents of `X509 Certificate only, Base64 encoded`
+   with the contents of `X509 Intermediates/root only Reverse, Base64 encoded`
+   and place it at path `/home/ssl/crfm-models.crt`. `crfm-models.crt` should look something like this:
+
+   ```text
+    -----BEGIN CERTIFICATE-----
+    (Your Primary SSL certificate: .crt)
+    -----END CERTIFICATE-----
+    -----BEGIN CERTIFICATE-----
+    (Your Intermediate certificate: reversed.crt)
+    -----END CERTIFICATE-----
+   ```
+3. Restart the server.
+4. Open the [website](https://crfm-models.stanford.edu) in a browser and verify the connection is secure.
+
+### Misplaced private key or CSR
+
+If, for whatever reason, the private key or CSR is misplaced, generate new ones by running:
+
+`sudo openssl req -new -nodes -newkey rsa:2048 -keyout private.key -out public.csr`
+
+and fill out the form:
+
+```text
+Country Name (2 letter code) [AU]:US
+State or Province Name (full name) [Some-State]:California
+Locality Name (eg, city) []:Stanford
+Organization Name (eg, company) [Internet Widgits Pty Ltd]:Stanford University
+Organizational Unit Name (eg, section) []:CRFM
+Common Name (e.g. server FQDN or YOUR name) []:crfm-models.stanford.edu
+Email Address []:
+
+Please enter the following 'extra' attributes
+to be sent with your certificate request
+A challenge password []:
+An optional company name []:
+```
+
+Then, follow the steps above to request for a new SSL certificate.
+
+## Deployment
+
+Every time we need to deploy, do the following.
+
+Update the code:
+
+    laptop:$ ssh crfm-models.stanford.edu
+    crfm-models:$ cd /home/benchmarking
+    crfm-models:$ git pull
+    crfm-models:$ ./pre-commit.sh
+
+If everything looks okay:
+
+    ssh [email protected]
+
+    # Switch into the screen session
+    crfm-models:$ screen -r deploy
+
+    # Hit ctrl-c to kill the existing process
+    # Restart the server
+    sudo venv/bin/proxy-server -p 443 --ssl-key-file /home/ssl/private.key --ssl-cert-file /home/ssl/crfm-models.crt --workers 16 &> server.log
+
+    # Exit the screen session: ctrl-ad
+
+The recommended number of Gunicorn workers is twice the number of cores.
+crfm-models.stanford.edu has 8 cores (verified with `nproc`) * 2 = 16 workers.
+
+Double check that the [website](https://crfm-models.stanford.edu) still works.
+The server logs can be streamed by running: `tail -f /home/benchmarking/server.log`.