Skip to content

Code and data corresponding to paper Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments

License

Unknown and 2 other licenses found

Licenses found

Unknown
LICENSE.md
MIT
LICENSE-MMLU
Apache-2.0
LICENSE-Winogrande
Notifications You must be signed in to change notification settings

InstituteforDiseaseModeling/Bridging-the-Gap-Low-Resource-African-Languages

Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments

Authors: Tuka Alhanai [email protected], Adam Kasumovic [email protected], Mohammad Ghassemi [email protected], Aven Zitzelberger [email protected], Jessica Lundin [email protected], Guillaume Chabot-Couture [email protected]

This repository contains the benchmarks, results, and all the code required to reproduce the results, tables, and figures presented in our paper.

More specifically, this repository contains:

  1. Translated Winogrande Benchmarks: Human and machine translations of Winogrande into 8 African languages: Shona, Igbo, Bambara, Amharic, Sepedi, Sesotho, Swtswana, and Tsonga (as well as preexisting translations into Afrikaans, Zulu, and Xhosa).
  2. Translated MMLU-Clinical Benchmarks: Human and machine translations of the clinical sections "college medicine", "clinical knowledge", and "virology" of MMLU into 8 African languages: Shona, Igbo, Bambara, Amharic, Sepedi, Sesotho, Swtswana, and Tsonga (as well as preexisting translations into Afrikaans, Zulu, and Xhosa for the "clinical knowledge" and "college medicine" sections; we translated the "virology" section into Afrikaans, Zulu, and Xhosa as well in this release).
  3. Human Annotation of Winogrande: Human annotations of the Winogrande dataset assessing translation quality and appropriateness in 11 African languages: Shona, Igbo, Bambara, Amharic, Sepedi, Sesotho, Swtswana, Tsonga, Afrikaans, Zulu, and Xhosa.
  4. Scripts to Reproduce Results: Code used to regenerate the results, tables, and figures presented in the paper.

Note that Meta's Belebele in English as well as Shona, Igbo, Bambara, Amharic, Sepedi, Sesotho, Swtswana, Tsonga, Afrikaans, Zulu, and Xhosa is also included since it was used as an African language evaluation benchmark for our experiments.

A PDF-version of a slideshow presentation depicting our work is also provided in this repository.

Our translated datasets can also be accessed on HuggingFace at the following link: https://huggingface.co/datasets/Institute-Disease-Modeling/mmlu-winogrande-afr

Contents

data/: Contains data files used by scripts in the scripts/ folder. This folder includes the translated evaluation benchmarks as well.

scripts/: Contains Python scripts to generate the results, tables, and figures presented in the paper (when possible to re-create them from the data provided). Instructions on how to run the scripts can be found inside the folder.

results/: Contains the output from the Python files in scripts/, and has an identical folder tree structure to scripts/.

utils/: Contains utility functions used by the scripts.

Navigate to any of these folders for more detailed information about each in the form of additional READMEs.

Benchmarks Download

The evaluation benchmarks (MMLU-Clinical [called mmlu_cm_ck_vir/], Winogrande [called winogrande_s/], and Belebele [called belebele/]) in English and 11 African languages are available as a ZIP file.

The ZIP file also includes machine translations of the benchmarks (see folders suffixed with _mt), as well as their backtranslations (see folders suffixed with _bt).

Within each benchmark directory, files are suffixed with ISO Language Codes (or no suffix for the original English) to denote the language the file is in (or was translated from for backtranslated versions). These language codes are used throughout the project to denote languages.

Installation

Note that if cloning this repository, be sure to run git lfs install first, using apt-get update; apt-get install git-lfs to install Git LFS if it isn't already. This is necessary to actually download the full data files needed to run many scripts. Cloning with git clone https://<your GitHub PAT>@github.com/ghamut/BMGF_AAAI_Paper.git should allow you to clone this private repository. See here for more information about cloning a private repository.

1. Set up the Python environment:

Feel free to use your IDE to automatically do this.

python3 -m venv venv  # Set up virtual environment
source venv/bin/activate  # Activate virtual environment
pip install --upgrade pip  # Update pip
pip install -r requirements.txt  # Install packages
conda install -c conda-forge firefox geckodriver  # Run this command to get Bokeh working for creating one of the figures

2. Set up config.py file (gitignored by default) containing secrets:

Note that this step is not required if you do not want to run any scripts that incur monetary costs (e.g. calling OpenAI's Batch API, which is not free).

cp config_template.py config.py

You will need to edit the newly created config.py, filling in the string for your OpenAI API key:

gpt_api_key = "your-OpenAI-GPT-API-key-here"

Your OpenAI API key can be found here: Creating/finding your API key

If you would like to run every script to reproduce the results that does not incur any monetary cost, simply run the following:

./run_everything.sh

The scripts will run in parallel but are not compute-heavy. The results can be viewed in results/. If using version control, only the figures should come up as "modified" due to how saving PDFs/SVGs works. The figures will be visually identical to their predecessors, though.

Disclaimer

The code in this repository was developed by IDM, the Bill & Melinda Gates Foundation, and Ghamut Corporation to further research in Large Language Models (LLMs) for low-resource African languages by allowing them to be evaluated on question-answering and commonsense reasoning tasks, like those commonly available in English. We’ve made it publicly available under the MIT License to provide others with a better understanding of our research and an opportunity to build upon it for their own work. We make no representations that the code works as intended or that we will provide support, address issues that are found, or accept pull requests. You are welcome to create your own fork and modify the code to suit your own modeling needs as contemplated under the MIT License.

Acknowledgments

This repository includes data derived from the following datasets, each subject to their respective licenses (copied from their respective GitHub repositories):

  1. MMLU Dataset
    • GitHub Repository: https://github.com/hendrycks/test
    • License: LICENSE-MMLU
    • For more licensing details, see the license terms specified in the file.
    • Citation (see below):
      @article{hendryckstest2021,
        title={Measuring Massive Multitask Language Understanding},
        author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
        journal={Proceedings of the International Conference on Learning Representations (ICLR)},
        year={2021}
      }
      
      @article{hendrycks2021ethics,
        title={Aligning AI With Shared Human Values},
        author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
        journal={Proceedings of the International Conference on Learning Representations (ICLR)},
        year={2021}
      }
      
  2. Winogrande Dataset
    • GitHub Repository: https://github.com/allenai/winogrande
    • License: LICENSE-Winogrande
    • For more licensing details, see the license terms specified in the file.
    • Citation (see below):
       @article{sakaguchi2019winogrande,
         title={WinoGrande: An Adversarial Winograd Schema Challenge at Scale},
         author={Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin},
         journal={arXiv preprint arXiv:1907.10641},
         year={2019}
       }
      
  3. Belebele Dataset
    • GitHub Repository: https://github.com/facebookresearch/belebele
    • Licenses: LICENSE-Belebele-CC-BY-NC4.0 and LICENSE-Belebele-CC-BY-SA4.0
    • For more licensing details, see the license terms specified in the files.
    • Citation (see below):
      @inproceedings{bandarkar-etal-2024-belebele,
        title = "The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants",
        author = "Bandarkar, Lucas  and
          Liang, Davis  and
          Muller, Benjamin  and
          Artetxe, Mikel  and
          Shukla, Satya Narayan  and
          Husa, Donald  and
          Goyal, Naman  and
          Krishnan, Abhinandan  and
          Zettlemoyer, Luke  and
          Khabsa, Madian",
        booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
        month = aug,
        year = "2024",
        address = "Bangkok, Thailand and virtual meeting",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2024.acl-long.44",
        pages = "749--775",
      }
      

Please note that the licenses for the included datasets are separate from and may impose additional restrictions beyond the repository's main license.

Citation

If you find this repository useful, please consider citing it:

@article{,
  title={Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments},
  author={Tuka Alhanai and Adam Kasumovic and Mohammad Ghassemi and Aven Zitzelberger and Jessica Lundin and Guillaume Chabot-Couture},
  year={2024}
}

About

Code and data corresponding to paper Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments

Resources

License

Unknown and 2 other licenses found

Licenses found

Unknown
LICENSE.md
MIT
LICENSE-MMLU
Apache-2.0
LICENSE-Winogrande

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published