We use tango and catwalk to build the pipeline. The catwalk code exists here.
The evaluation pipeline is run as a cross product of models that need to be evaluated, and task sets.
- Ensure that model paths are present in a
gs://
ors3://
location. - Copy
evaluation/experiments/test_config.jsonnet
toevaluation/experiment_YYYY_MM_DD.jsonnet
- Add models and choose relevant task sets from experiments/task_sets.
export GITHUB_TOKEN="<your token>" # Needed for beaker to clone the repo.
export GOOGLE_TOKEN="<google credentials>" (or simply gcloud auth login) # If you are using a GS workspace.
- Share the google sheet with
[email protected]
. - Create API json key and download from here.
- Add a beaker secret:
from tango.integrations.beaker.common import get_client
beaker = get_client("<beaker_workspace>")
with open("credentials_file.json") as f:
beaker.secret.write("GDRIVE_SERVICE_ACCOUNT_JSON", f.read())
export GDRIVE_SERVICE_ACCOUNT_JSON=$(cat credentials_file.json)
tango run evaluation/experiments/test_config.jsonnet -w your-local-workspace --include-package evaluation.steps
- Update
evaluation/tango-in-beaker.yml
(the fields that should be updated are marked).
tango --settings evaluation/tango-in-beaker.yml run evaluation/experiments/test_config.jsonnet
If you specify gsheet
in your config, results will be appended to the google sheet.
All intermediate and final results will also be saved to the specified workspace, and can be accessed as follows:
from tango import Workspace
workspace = Workspace.from_url("gs://your-workspace-url")
result = workspace.step_result("combine-all-outputs")
A task set is of the form:
{
name: "<Name of the task set>",
tasks: [
{
task_name: "<One of the tasks present in `TASKS_LM` or `TASKS`>",
task_kwargs: "<task-specific kwargs (See eval_suite for examples)>",
prediction_kwargs: "<kwargs on how to evaluate the model on this task>"
}
]
}
- Add new task sets under
evaluation/experiments/task_sets
(Current full sets:gen_tasks.libsonnet
,eval_suite_ppl_val_v2_small.libsonnet
,rc20_tasks.libsonnet
,summary_tasks.libsonnet
). - The list of potential tasks can be seen by running
python evaluation/see_available_tasks.py
.
- Add the new set under our current ppl data at /net/nfs.cirrascale/allennlp/akshitab/eval_data.
- Add the name of the folder to
experiments/task_sets/eval_suite_ppl_val_v2_small.libsonnet
- See
gen_tasks.libsonnet
for a simple example.
(TODO: catwalk needs better documentation on adding new tasks).