-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding MMLU and Winogrande human-translated into 11 African languages #3237
base: main
Are you sure you want to change the base?
Adding MMLU and Winogrande human-translated into 11 African languages #3237
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, thanks! I have some minor suggestions; let me know if you'd like to make those changes.
"""Run spec functions for three clinical sections of MMLU human-translated into 11 African languages | ||
|
||
Available subjects: "clinical_knowledge", "college_medicine", "virology" | ||
Available langs: "af", "zu", "xh", "am", "bm", "ig", "nso", "sn", "st", "tn", "ts" (see lang_map below for language code mapping to language name, or here for ISO code reference: https://huggingface.co/languages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: link to https://www.loc.gov/standards/iso639-2/php/code_list.php or https://iso639-3.sil.org/code_tables/639/data as a better reference for ISO 639 codes.
'af': 'Afrikaans', | ||
'zu': 'Zulu', | ||
'xh': 'Xhosa', | ||
'am': 'Amharic', | ||
'bm': 'Bambara', | ||
'ig': 'Igbo', | ||
'nso': 'Sepedi', | ||
'sn': 'Shona', | ||
'st': 'Sesotho', | ||
'tn': 'Setswana', | ||
'ts': 'Tsonga', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note: the format here is "use the two-letter code if available, otherwise use the three-letter code". This seems fine to me (since three-letter codes are rarely used), but note that another alternative would be to just use the three-letter codes for all languages.
@@ -0,0 +1,64 @@ | |||
"""Run spec functions for Winogrande human-translated into 11 African languages | |||
|
|||
Available langs: "af", "zu", "xh", "am", "bm", "ig", "nso", "sn", "st", "tn", "ts" (see lang_map below for language code mapping to language name, or here for ISO code reference: https://huggingface.co/languages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: link to https://www.loc.gov/standards/iso639-2/php/code_list.php or https://iso639-3.sil.org/code_tables/639/data as a better reference for ISO 639 codes.
|
||
def download_mmlu_clinical_afr(self, path: str): | ||
ensure_file_downloaded( | ||
source_url="https://github.com/InstituteforDiseaseModeling/Bridging-the-Gap-Low-Resource-African-Languages/raw/refs/heads/main/data/evaluation_benchmarks_afr_release.zip", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use a specific githash URL so that the data does not change when the git is updated https://raw.githubusercontent.com/InstituteforDiseaseModeling/Bridging-the-Gap-Low-Resource-African-Languages/9af6ce2f5df8171a64d58ced2032761396bfb2ad/data/evaluation_benchmarks_afr_release.zip
|
||
def download_winogrande_afr(self, path: str): | ||
ensure_file_downloaded( | ||
source_url="https://github.com/InstituteforDiseaseModeling/Bridging-the-Gap-Low-Resource-African-Languages/raw/refs/heads/main/data/evaluation_benchmarks_afr_release.zip", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use a specific githash URL so that the data does not change when the git is updated https://raw.githubusercontent.com/InstituteforDiseaseModeling/Bridging-the-Gap-Low-Resource-African-Languages/9af6ce2f5df8171a64d58ced2032761396bfb2ad/data/evaluation_benchmarks_afr_release.zip
description = "Winogrande (S) translated into 11 African low-resource languages" | ||
tags = ["knowledge", "multiple_choice", "low_resource_languages"] | ||
|
||
def __init__(self, lang: str = "af"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't set default values here
description = "Massive Multitask Language Understanding (MMLU) translated into 11 African low-resource languages" | ||
tags = ["knowledge", "multiple_choice", "low_resource_languages"] | ||
|
||
def __init__(self, subject: str = "clinical_knowledge", lang: str = "af"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't set default values here
unpack_type='unzip' | ||
) | ||
|
||
def process_csv(self, csv_path: str, split: str, pseudo_split: str) -> List[Instance]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename split
and pseudo_split
to helm_split
and source_split
/ csv_split
for clarity
Could you also run the linter: pip install black==24.3.0 mypy==1.5.1 flake8==5.0.4 then run |
Description
This pull request introduces the first low-resource language translation of 3 medical subjects of Massive Multitask Language Understanding (MMLU) and Winograde Small dataset to be added to the HELM benchmark. This project focuses on bridging the gap for low-resource African languages, providing a valuable resource for researchers and developers working in this area.
Repository
https://github.com/InstituteforDiseaseModeling/Bridging-the-Gap-Low-Resource-African-Languages
Dataset
https://huggingface.co/datasets/Institute-Disease-Modeling/mmlu-winogrande-afr
Paper
https://arxiv.org/pdf/2412.12417
Highlights