v0.2.3 - 2023-07-25
- Added BigCode StarCoder (#1506)
- Added OPT 1.3B and 6.7B (#1468)
- Added OpenAI gpt-3.5-turbo-0613 (#1667), gpt-3.5-turbo-16k-0613, gpt-4-0613, gpt-4-32k-0613 (#1468), gpt-4-32k-0314, gpt-4-32k-0314 (#1457)
- Added OpenAI text-embedding-ada-002 (#1711)
- Added Writer Palmyra (#1669, #1491)
- Added Anthropic Claude (#1484)
- Added Databricks Koala on Together (#1701)
- Added Stability AI StableLM and Together RedPajama on Together
- Added legal summarization scenarios (#1454)
- Fixed corner cases in window service truncation (#1449)
- Pinned file order for ICE, APPS (code) and ICE scenarios (#1352)
- Fixed random seed for entity matching scenario (#1475)
- Added Spider text-to-SQL (#1385)
- Added Vicuna scenario (#1641), Koala scenario (#1642), open_assistant scenario (#1622), and Anthropic-HH-RLHF scenario (#1643) for instruction-following
- Added verifiability judgement scenario (#1518)
- Fixed bug in multi-choice exact match calculation when scores are tied (#1494)
- Added script for estimating the cost of a run suite (#1480)
- Added support for human critique evaluation using Surge AI (#1330), Scale AI (#1609), and Amazon Mechanical Turk (#1539)
- Added support for LLM critique evaluation (#1627)
- Decreased running time of helm-summarize (#1716)
- Added
SlurmRunner
for distributinghelm-run
jobs over Slurm (#1550) - Migrated to the
setuptools.build_meta
backend (#1535) - Stopped non-retriable errors (e.g. content filter errors) from being retried (#1533)
- Added logging for stack trace and exception message when retries occur (#1555)
- Added file locking for
ensure_file_downloaded()
(#1692)
- Added evaluation results for AI21 Jurassic-2 and Writer Palmyra
v0.2.2 - 2023-03-30
- Added Cohere Command (#1321)
- Added Flan-T5 (#1398)
- Added H3 (#1398)
- Added GPT-NeoXT-Chat-Base-20B (#1407)
- Added OpenAI gpt-3.5-turbo-0301 (#1401)
- Added AI21 Jurassic-2 models (#1409)
- Some improvements to LEXTREME and LexGLUE legal scenarios (#1429)
- Added OpinionsQA scenario (#1424)
- Added multilabel classification metrics (#1408)
- Fixed
--exit-on-error
not working and added--skip-completed-runs
(#1400) - Disabled tqdm in non-interactive mode (#1351)
- Added plotting (#1403, #1411)
- Added Hugging Face Model Hub integration (#1103)
- Added evaluation results for Cohere Command and Aleph Alpha Luminous
v0.2.1 - 2022-02-24
- Added BigCode SantaCoder (#1312)
- Added LEXTREME and LexGLUE legal scenarios (#1216)
- Added WMT14 machine translation scenario (#1329)
- Added biomedical scenarios: COVID Dialogue, MeQSum, MedDialog, MedMCQA, MedParagraphSimplification, MedQA, PubMedQA (#1332)
- Added
--run-specs
flag tohelm-run
(#1302) - Reduced running time of
helm-summarize
(#1269) - Added classification metrics (#1368)
- Updated released JSON assets to conform to current JSON schema
v0.2.0 2022-01-11
- Added Aeph Alpha's Luminous models (#1215)
- Added AI21's J1-Grande v2 beta model (#1177)
- Added OpenAI's ChatGPT model (#1231)
- Added OpenAI's text-davinci-003 model (#1200)
- Added filtering by subject and level for MATHScenario (#1137)
- Reduced frontend JSON file sizes (#1185)
- Added table sorting in frontend (#832)
- Fixed frontend bugs for certain adapter methods (#1236, #1237)
- Fixed frontend bugs for runs with multiple trials (#1211)
- Improved sampling of in-context examples (#1172)
- Internal refactor (#1280)
- Added average win-rate computation for model-v-scenario tables (#1240)
- Added additional calibration metrics as a "Targeted evaluation" (#1247)
- Added documentation to Read the Docs (#1159, #1164)
- Breaking schema change:
input
ofInstance
andoutput
ofReference
are now objects (#1280)
v0.1.0 - 2022-11-17
- Initial release