Name		Name	Last commit message	Last commit date
parent directory ..
experiments		experiments
steps		steps
README.md		README.md
__init__.py		__init__.py
empty_workspace.py		empty_workspace.py
requirements.txt		requirements.txt
see_available_tasks.py		see_available_tasks.py
tango-in-beaker.yml		tango-in-beaker.yml

README.md

Evaluation

We use tango and catwalk to build the pipeline. The catwalk code exists here.

Creating an evaluation config

The evaluation pipeline is run as a cross product of models that need to be evaluated, and task sets.

Ensure that model paths are present in a gs:// or s3:// location.
Copy evaluation/experiments/test_config.jsonnet to evaluation/experiment_YYYY_MM_DD.jsonnet
Add models and choose relevant task sets from experiments/task_sets.

Running the pipeline

Basic setup

export GITHUB_TOKEN="<your token>"  # Needed for beaker to clone the repo.
export GOOGLE_TOKEN="<google credentials>"  (or simply gcloud auth login) # If you are using a GS workspace.

If specifying a Google Sheet to write results to

Share the google sheet with [email protected].
Create API json key and download from here.
Add a beaker secret:

from tango.integrations.beaker.common import get_client
beaker = get_client("<beaker_workspace>")

with open("credentials_file.json") as f:
    beaker.secret.write("GDRIVE_SERVICE_ACCOUNT_JSON", f.read())

export GDRIVE_SERVICE_ACCOUNT_JSON=$(cat credentials_file.json)

Run locally

tango run evaluation/experiments/test_config.jsonnet -w your-local-workspace --include-package evaluation.steps

Run on beaker

Update evaluation/tango-in-beaker.yml (the fields that should be updated are marked).

tango --settings evaluation/tango-in-beaker.yml run evaluation/experiments/test_config.jsonnet

See results

If you specify gsheet in your config, results will be appended to the google sheet.

All intermediate and final results will also be saved to the specified workspace, and can be accessed as follows:

from tango import Workspace
workspace = Workspace.from_url("gs://your-workspace-url")
result = workspace.step_result("combine-all-outputs")

Adding new task sets

A task set is of the form:

{
    name: "<Name of the task set>",
    tasks: [
        {
            task_name: "<One of the tasks present in `TASKS_LM` or `TASKS`>",
            task_kwargs: "<task-specific kwargs (See eval_suite for examples)>",
            prediction_kwargs: "<kwargs on how to evaluate the model on this task>"
        }
    ]
}

Add new task sets under evaluation/experiments/task_sets (Current full sets: gen_tasks.libsonnet, eval_suite_ppl_val_v2_small.libsonnet, rc20_tasks.libsonnet, summary_tasks.libsonnet).
The list of potential tasks can be seen by running python evaluation/see_available_tasks.py.

Adding a new dataset to our perplexity eval set

Add the new set under our current ppl data at /net/nfs.cirrascale/allennlp/akshitab/eval_data.
Add the name of the folder to experiments/task_sets/eval_suite_ppl_val_v2_small.libsonnet

Adding tasks already present in catwalk

See gen_tasks.libsonnet for a simple example.

Adding new tasks to catwalk

(TODO: catwalk needs better documentation on adding new tasks).

See examples here.
Add newly created tasks to TASKS_LM or TASKS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

evaluation

README.md

Evaluation

Creating an evaluation config

Running the pipeline

Basic setup

If specifying a Google Sheet to write results to

Run locally

Run on beaker

See results

Adding new task sets

Adding a new dataset to our perplexity eval set

Adding tasks already present in catwalk

Adding new tasks to catwalk

Files

evaluation

Directory actions

More options

Directory actions

More options

Latest commit

History

evaluation

Folders and files

parent directory

README.md

Evaluation

Creating an evaluation config

Running the pipeline

Basic setup

If specifying a Google Sheet to write results to

Run locally

Run on beaker

See results

Adding new task sets

Adding a new dataset to our perplexity eval set

Adding tasks already present in catwalk

Adding new tasks to catwalk