From 257925e12307ffbff246e51e647f2fb7b5226455 Mon Sep 17 00:00:00 2001 From: Rohit <10227643+gitkwr@users.noreply.github.com> Date: Thu, 18 Apr 2024 12:20:04 -0400 Subject: [PATCH] Update eval_details.md Adding missing hyperlinks --- eval_details.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/eval_details.md b/eval_details.md index e62665f..63d298a 100644 --- a/eval_details.md +++ b/eval_details.md @@ -5,9 +5,9 @@ This document contains additional context on the settings and parameters for how - We are reporting macro averages for MMLU benchmarks. The micro average numbers for MMLU are: 65.4 and 67.4 for the 8B pre-trained and instruct-aligned models, 78.9 and 82.0 for the 70B pre-trained and instruct-aligned models - For the instruct-aligned MMLU we ask the model to generate the best choice character #### AGI English -- We use the default few-shot and prompt settings as specified here. The score is averaged over the english subtasks. +- We use the default few-shot and prompt settings as specified [here](https://github.com/ruixiangcui/AGIEval). The score is averaged over the english subtasks. #### CommonSenseQA -- We use the same 7-shot chain-of-thought prompt as in Wei et al. (2022). +- We use the same 7-shot chain-of-thought prompt as in [Wei et al. (2022)](https://arxiv.org/pdf/2201.11903.pdf). #### Winogrande - We use a choice based setup for evaluation where we fill in the missing blank with the two possible choices and then compute log-likelihood over the suffix. We use 5 shots for evaluation. #### BIG-Bench Hard @@ -29,9 +29,9 @@ This document contains additional context on the settings and parameters for how #### HumanEval - Same setting as Llama 1 and Llama 2 (pass@1). #### GSM8K -- We use the same 8-shot chain-of-thought prompt as in Wei et al. (2022) (maj@1). +- We use the same 8-shot chain-of-thought prompt as in [Wei et al. (2022)](https://arxiv.org/pdf/2201.11903.pdf) (maj@1). #### MATH -- We use the 4-shot problem available in Lewkowycz et al. (2022) (maj@1). +- We use the 4-shot problem available in [Lewkowycz et al. (2022)](https://arxiv.org/pdf/2206.14858.pdf) (maj@1). ### Human evaluation notes This evaluation set contains 1,800 prompts that cover 12 key use cases: asking for advice, brainstorming, classification, closed question answering, coding, creative writing, extraction, inhabiting a character/persona, open question answering, reasoning, rewriting, and summarization. |Category|Count|