Skip to content

Commit

Permalink
Open-Sourcing as promised.
Browse files Browse the repository at this point in the history
  • Loading branch information
milangritta authored Jul 23, 2024
1 parent 0cc7512 commit 12a4da6
Show file tree
Hide file tree
Showing 16 changed files with 3,367 additions and 0 deletions.
9 changes: 9 additions & 0 deletions NLP/HumanRankEval/LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
MIT License

Copyright (c) 2023, Huawei Technologies Co., Ltd

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
119 changes: 119 additions & 0 deletions NLP/HumanRankEval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
## HumanRankEval: Automatic Evaluation of Alignment with Human Preferences

#### The repository is based on [EleutherAI LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness), big thanks!

This project provides a framework to evaluate generative language models (Seq2seq also supported by AutoHF) on HumanRankEval (HRE).
If you find it helpful, please cite the **HumanRankEval** [paper](LINK_TO_BE_ADDED).

- Supported Topics: **python, java, unix, cpp, html, english, physics, latex, soft_eng, stats, cs_db, languages_sciences, apple_android, math**
- Supported Models: **AutoHF (single and multi-gpu runs implemented, see below)**
- Supported Deepspeed Inference: **Tensor Parallel, Kernel Injection and/or DS ZeRO3**

### Installation (PyTorch)

Create an environment with conda or virtualenv and then run the following command:

```bash
pip install -r requirements.txt
```

### Installation (MindSpore)

You **additionally** need to install [MindSpore](https://www.mindspore.cn/install/en) and [MindNLP](https://github.com/mindspore-lab/mindnlp).
We provide an example in ```lm_eval.models.mindspore``` for OPT (facebook) models that can be extended to additional LLMs.

### Dataset

The HRE dataset is hosted on [HuggingFace Datasets](https://huggingface.co/datasets/huawei-noah/human_rank_eval).
It will be automatically (down)loaded with: ```load_dataset("huawei-noah/human_rank_eval")```

### Running HumanRankEval

Set the **MODEL_DIR=/your/path/to/models/**

Set the **DATA_PATH=/your/path/to/HumanRankEvalData/**

> 💡 Check out ```evaluate.sh``` for full details 💡
>
The following command runs Pythia-410M on HRE on gpu:2 (see **evaluate.sh**):
```
deepspeed --include localhost:2 main.py \
--model auto_hf \
--tasks human_rank_eval_* \
--model_args pretrained=${MODEL_DIR}Pythia-410M \
--batch_size 8 \
--data_path ${DATA_PATH}
```

The output should look like this:

| Task | Metric |Value |
|----------------------------------|-------------|-----:|
|human_rank_eval_apple_android |pearson_corr |0.0860|
|human_rank_eval_cpp |pearson_corr |0.1351|
|human_rank_eval_cs_db |pearson_corr |0.0646|
|human_rank_eval_english |pearson_corr |0.1193|
|human_rank_eval_html |pearson_corr |0.1055|
|human_rank_eval_java |pearson_corr |0.1044|
|human_rank_eval_languages_sciences|pearson_corr |0.1201|
|human_rank_eval_latex |pearson_corr |0.1648|
|human_rank_eval_math |pearson_corr |0.1405|
|human_rank_eval_physics |pearson_corr |0.1118|
|human_rank_eval_python |pearson_corr |0.0778|
|human_rank_eval_soft_eng |pearson_corr |0.0769|
|human_rank_eval_stats |pearson_corr |0.1100|
|human_rank_eval_unix |pearson_corr |0.0967|
|=== HumanRankEval Score === |Micro Average|0.1081|

The following command runs Vicuna-7B on HRE on all gpus with tensor parallel (default).
```bash
deepspeed --num_gpus ${NUM_GPUs} main.py \
--model auto_hf \
--tasks human_rank_eval_* \
--model_args pretrained=${MODEL_DIR}Vicuna-7B \
--data_path ${DATA_PATH} \
--batch_size 4 \
--world_size ${NUM_GPUs}
```
The output should look like this:

| Task | Metric |Value |
|----------------------------------|-------------|-----:|
|human_rank_eval_apple_android |pearson_corr |0.1310|
|human_rank_eval_cpp |pearson_corr |0.1657|
|human_rank_eval_cs_db |pearson_corr |0.1043|
|human_rank_eval_english |pearson_corr |0.1468|
|human_rank_eval_html |pearson_corr |0.1430|
|human_rank_eval_java |pearson_corr |0.1670|
|human_rank_eval_languages_sciences|pearson_corr |0.1571|
|human_rank_eval_latex |pearson_corr |0.1743|
|human_rank_eval_math |pearson_corr |0.1257|
|human_rank_eval_physics |pearson_corr |0.1114|
|human_rank_eval_python |pearson_corr |0.1402|
|human_rank_eval_soft_eng |pearson_corr |0.0962|
|human_rank_eval_stats |pearson_corr |0.1629|
|human_rank_eval_unix |pearson_corr |0.1289|
|=== HumanRankEval Score === |Micro Average|0.1396|

Evaluating a MindSpore model on a single topic can be done as follows:

```bash
python main.py --model mindspore \
--tasks human_rank_eval_math \
--data_path ${DATA_PATH} \
--model_args pretrained=opt-350m \
--batch_size 4
```

You should see the following output:

| Task | Metric |Value|
|---------------------------|-------------|----:|
|human_rank_eval_math |pearson_corr |0.078|
|=== HumanRankEval Score ===|Micro Average|0.078|

## License

We follow MIT license. Please see the [License](./LICENSE) file for more information.

Disclaimer: This open source project is not an official Huawei product, Huawei is not expected to provide support for this project.
32 changes: 32 additions & 0 deletions NLP/HumanRankEval/THIRD_PARTY_OPEN_SOURCE_SOFTWARE_NOTICE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
Please note we provide an open source software notice for the third party open source software
along with this software and/or this software component contributed by Huawei (in the following just “this SOFTWARE”).
The open source software licenses are granted by the respective right holders.

Warranty Disclaimer
THE OPEN SOURCE SOFTWARE IN THIS SOFTWARE IS DISTRIBUTED IN THE HOPE THAT IT WILL BE USEFUL,
BUT WITHOUT ANY WARRANTY, WITHOUT EVEN THE IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE. SEE THE APPLICABLE LICENSES FOR MORE DETAILS.

Copyright Notice and License Texts
Software: Language Model Evaluation Harness (https://github.com/EleutherAI/lm-evaluation-harness)
Copyright (c) 2020 EleutherAI

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
43 changes: 43 additions & 0 deletions NLP/HumanRankEval/evaluate.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/usr/bin/env bash

# Copyright (C) 2023. Huawei Technologies Co., Ltd. All rights reserved.
#
# Licensed under MIT License (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://opensource.org/license/mit
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================

NUM_GPUs=4
DATA_PATH="/path/to/HumanRankEvalData/"
MODEL_DIR="/path/to/models/"

#---------------------------------------------------------------

deepspeed --num_gpus ${NUM_GPUs} main.py \
--model auto_hf \
--tasks human_rank_eval_* \
--model_args pretrained=${MODEL_DIR}Vicuna-7B \
--data_path ${DATA_PATH} \
--batch_size 4 \
--world_size ${NUM_GPUs}

#deepspeed --include localhost:2 main.py \
# --model auto_hf \
# --tasks human_rank_eval_* \
# --model_args pretrained=${MODEL_DIR}Pythia-410M \
# --batch_size 8 \
# --data_path ${DATA_PATH}

#python main.py --model mindspore \
# --tasks human_rank_eval_math \
# --data_path ${DATA_PATH} \
# --model_args pretrained=opt-350m \
# --batch_size 4
Empty file.
Loading

0 comments on commit 12a4da6

Please sign in to comment.