forked from THUDM/CodeGeeX
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Release cross-platform source code and weights
- Loading branch information
1 parent
4253120
commit 07b8d7f
Showing
56 changed files
with
10,410 additions
and
196 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,37 +1,71 @@ | ||
<img src="resources/logo/codegeex_logo.png"> | ||
|
||
<p align="center"> | ||
🏠 <a href="https://models.aminer.cn/codegeex" target="_blank">Homepage</a> | 📖 <a href="http://keg.cs.tsinghua.edu.cn/codegeex/" target="_blank">Blog</a> | 🪧 <a href="https://models.aminer.cn/codegeex/playground" target="_blank">DEMO</a> | 🛠 <a href="https://marketplace.visualstudio.com/items?itemName=aminer.codegeex" target="_blank">VS Code Extension</a> | 📃 Paper(Coming soon!) | 🌐 <a href="README_zh.md" target="_blank">中文</a> | ||
🏠 <a href="https://models.aminer.cn/codegeex" target="_blank">Homepage</a> | 📖 <a href="http://keg.cs.tsinghua.edu.cn/codegeex/" target="_blank">Blog</a> | 🪧 <a href="https://models.aminer.cn/codegeex/playground" target="_blank">DEMO</a> | 🤖 <a href="https://models.aminer.cn/codegeex/download/request" target="_blank">Download Model</a> | 📃 Paper(Coming soon!) | | ||
</p> | ||
<p align="center"> | ||
🛠 <a href="https://marketplace.visualstudio.com/items?itemName=aminer.codegeex" target="_blank">VS Code Extension</a> | 🌐 <a href="README_zh.md" target="_blank">中文</a> | ||
</p> | ||
|
||
 | ||
 | ||
 | ||
 | ||
 | ||
|
||
# CodeGeeX: A Multilingual Code Generative Model | ||
We introduce CodeGeeX, a large-scale multilingual code generative model with 13 billion parameters, pre-trained on a large code corpus of more than 20 programming languages. As of **June 22**, 2022, CodeGeeX has been trained on more than 850 billion tokens on a cluster of 1,536 [Ascend 910 AI Processors](https://e.huawei.com/en/products/servers/ascend). CodeGeeX has several unique features: | ||
# CodeGeeX: A Multilingual Code Generation Model | ||
|
||
We introduce CodeGeeX, a large-scale multilingual code generation model with 13 billion parameters, pre-trained on a large code corpus of more than 20 programming languages. As of **June 22**, 2022, CodeGeeX has been trained on more than 850 billion tokens on a cluster of 1,536 [Ascend 910 AI Processors](https://e.huawei.com/en/products/servers/ascend). CodeGeeX has several unique features: | ||
* **Multilingual Code Generation**: CodeGeeX has good performance for generating executable programs in several mainstream programming languages, including Python, C++, Java, JavaScript, Go, etc. [DEMO](https://models.aminer.cn/codegeex) | ||
* **Crosslingual Code Translation**: CodeGeeX supports the translation of code snippets between different languages. Simply by one click, CodeGeeX can transform a program into any expected language with a high accuracy. [DEMO](https://models.aminer.cn/codegeex/codeTranslator) | ||
* **Customizable Programming Assistant**: CodeGeeX is available in the VS Code extension marketplace **for free**. It supports code completion, explanation, summarization and more, which empower users with a better coding experience. [VS Code Extension](https://marketplace.visualstudio.com/items?itemName=aminer.codegeex) | ||
* **Open-Source and Cross-Platform**: All codes and model weights will be made publicly available for research purposes. We have also been working on the adaptation to other GPU platforms, which will be ready soon. | ||
* **Open-Source and Cross-Platform**: All codes and model weights are publicly available for research purposes. CodeGeeX supports both Ascend and NVIDIA platforms. It supports inference in a single Ascend 910, NVIDIA V100 or A100. [Apply Model Weights](https://models.aminer.cn/codegeex/download/request) | ||
|
||
**HumanEval-X for Realistic Multilingual Benchmarking.** To help standardize the evaluation of multilingual code generation and translation, we develop and release the **HumanEval-X** Benchmark. HumanEval-X is a new multilingual benchmark that contains **820 human-crafted** coding problems in **5** programming languages (Python, C++, Java, JavaScript, and Go), each of these problems is associated with tests and solutions. [Usage](codegeex/benchmark/README.md) | ||
**HumanEval-X for Realistic Multilingual Benchmarking.** To help standardize the evaluation of multilingual code generation and translation, we develop and release the **HumanEval-X** Benchmark. HumanEval-X is a new multilingual benchmark that contains **820 human-crafted** coding problems in **5** programming languages (Python, C++, Java, JavaScript, and Go), each of these problems is associated with tests and solutions. [Usage](codegeex/benchmark/README.md) [🤗 Available in HuggingFace](https://huggingface.co/datasets/THUDM/humaneval-x) | ||
|
||
<img src="resources/en/hx_boxplot.png"> | ||
|
||
<p align="center"><i>CodeGeeX achieves the highest average performance compared with other open-sourced multilingual baselines.</i> </p> | ||
|
||
## News | ||
|
||
* **2022-09-30**: We release the cross-platform source code and models weghts for both Ascend and NVIDIA platforms. | ||
|
||
## Getting Started | ||
|
||
CodeGeeX is initially implemented in Mindspore and trained Ascend 910 AI Processors. We provide a torch-compatible version based on [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) to facilitate usage on GPU platforms. | ||
### Installation | ||
|
||
Download and install ``codegeex`` via: | ||
Python 3.7+ / CUDA 11+ / PyTorch 1.10+ / DeepSpeed 0.6+ are required. Install ``codegeex`` package via: | ||
```bash | ||
git clone [email protected]:THUDM/CodeGeeX.git | ||
cd CodeGeeX | ||
pip install -e . | ||
``` | ||
|
||
### Model Weights | ||
|
||
Apply and download model weights through this [link](https://models.aminer.cn/codegeex/download/request). You'll receive by mail ```urls.txt``` that contains temporary download links. We recommend you to use [aria2](https://aria2.github.io/) to download it via the following command (Please make sure you have enough disk space to download the checkpoint (~26GB)): | ||
```bash | ||
aria2c -x 16 -s 16 -j 4 --continue=true -i urls.txt | ||
``` | ||
Run the following command to get the full model weights: | ||
```bash | ||
cat codegeex_13b.tar.gz.part.* > codegeex_13b.tar | ||
tar xvf codegeex_13b.tar.gz | ||
``` | ||
|
||
### Inference on GPUs | ||
|
||
CodeGeeX is initially implemented in Mindspore and trained on Ascend 910 AI Processors. In addition to the support on the Ascend platform, we have been also working on adapting the model to other GPU platforms, which will be ready soon. | ||
Have a try on generating the first program with CodeGeeX. First, specify the path of the model weights in ``configs/codegeex_13b.sh``. Second, write the prompt (natural language description or code snippet) into a file, e.g., ``tests/test_prompt.txt``, then run the following script: | ||
```bash | ||
bash ./scripts/test_inference.sh <GPU_ID> ./tests/test_prompt.txt | ||
``` | ||
|
||
### VS Code Extension Guidance | ||
|
||
Based on CodeGeeX, we also develop a free VS Code extention, search "codegeex" in Marketplace or install it [here](https://marketplace.visualstudio.com/items?itemName=aminer.codegeex). Detailed instructions can be found in | ||
[CodeGeeX Extension Guidance](vscode-extension/README.md). | ||
|
||
## CodeGeeX: Architecture, Code Corpus, and Implementation | ||
|
||
|
@@ -46,7 +80,7 @@ CodeGeeX is initially implemented in Mindspore and trained on Ascend 910 AI Proc | |
**Training**: We implement CodeGeeX in [Mindspore 1.7](https://www.mindspore.cn/) and train it on 1,536 Ascend 910 AI Processor (32GB). The model weights are under FP16 format, except that we use FP32 for layer-norm and softmax for higher precision and stability. The entire model consumes about 27GB of memory. To increase the training efficiency, we adopt an 8-way model parallel training together with 192-way data parallel training, with ZeRO-2 optimizer enabled. The micro-batch size is 16 and the global batch size reaches 3,072. Moreover, we adopt techniques to further boost the training efficiency including the element-wise operator fusion, fast gelu activation, matrix multiplication dimension optimization, etc. The entire training process takes nearly two months, spanning from April 18 to June 22, 2022, during which 850B tokens were passed for training, i.e., 5+ epochs. | ||
|
||
## HumanEval-X: A new benchmark for Multilingual Program Synthesis | ||
To better evaluate the multilingual ability of code generative models, we propose a new benchmark HumanEval-X. While previous works evaluate multilingual program synthesis under semantic similarity (e.g., [CodeBLEU](https://arxiv.org/abs/2009.10297)) which is often misleading, HumanEval-X evaluates the functional correctness of the generated programs. HumanEval-X consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks. | ||
To better evaluate the multilingual ability of code generation models, we propose a new benchmark HumanEval-X. While previous works evaluate multilingual program synthesis under semantic similarity (e.g., [CodeBLEU](https://arxiv.org/abs/2009.10297)) which is often misleading, HumanEval-X evaluates the functional correctness of the generated programs. HumanEval-X consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks. | ||
|
||
<img src="resources/en/hx_tasks.png"> | ||
|
||
|
@@ -60,7 +94,7 @@ In HumanEval-X, every sample in each language contains declaration, docstring, a | |
<p align="center"><i><b>Left</b>: the detailed pass@k (k=1,10,100) performance on code generation task for five languages in HumanEval-X. <b>Right</b>: the average performance of all languages of each model. CodeGeeX achieves the highest average performance compared with InCoder-6.7B, CodeGen-Multi-6B and CodeGen-Multi-16B.</i></p> | ||
|
||
|
||
We compare CodeGeeX with two other open-sourced code generative models, [InCoder](https://github.com/dpfried/incoder) (from Meta) and [CodeGen](https://github.com/salesforce/CodeGen) (from Salesforce). Specifically, InCoder-6.7B, CodeGen-Multi-6B and CodeGen-Multi-16B are considered. CodeGeeX significantly outperforms models with smaller scales (by 7.5%~16.3%) and is competitive with CodeGen-Multi-16B with a larger scale (average performance 54.76% vs. 54.39%). CodeGeeX achieves the best average performance across languages. We further investigate the effect of distributing sampling budgets to different languages. Using a simple heuristic that distributes budgets weighted by the training data distribution, CodeGeeX achieves a higher pass rate than any single language (indicated by the <span style='color:red'>red dotted circle</span>). | ||
We compare CodeGeeX with two other open-sourced code generation models, [InCoder](https://github.com/dpfried/incoder) (from Meta) and [CodeGen](https://github.com/salesforce/CodeGen) (from Salesforce). Specifically, InCoder-6.7B, CodeGen-Multi-6B and CodeGen-Multi-16B are considered. CodeGeeX significantly outperforms models with smaller scales (by 7.5%~16.3%) and is competitive with CodeGen-Multi-16B with a larger scale (average performance 54.76% vs. 54.39%). CodeGeeX achieves the best average performance across languages. | ||
|
||
### Crosslingual Code Translation | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,37 +1,70 @@ | ||
<img src="resources/logo/codegeex_logo.png"> | ||
|
||
<p align="center"> | ||
🏠 <a href="https://models.aminer.cn/codegeex/zh-CN" target="_blank">主页</a> | 📖 <a href="http://keg.cs.tsinghua.edu.cn/codegeex/index_zh.html" target="_blank">博客</a> | 🪧 <a href="https://models.aminer.cn/codegeex/zh-CN/playground" target="_blank">示例</a> | 🛠 <a href="https://marketplace.visualstudio.com/items?itemName=aminer.codegeex" target="_blank">VS Code插件</a> | 📃 论文(即将推出!)| 🌐 <a href="https://github.com/THUDM/CodeGeeX/blob/main/README.md" target="_blank">English</a> | ||
🏠 <a href="https://models.aminer.cn/codegeex/zh-CN" target="_blank">主页</a> | 📖 <a href="http://keg.cs.tsinghua.edu.cn/codegeex/index_zh.html" target="_blank">博客</a> | 🪧 <a href="https://models.aminer.cn/codegeex/zh-CN/playground" target="_blank">示例</a> | 🤖 <a href="https://models.aminer.cn/codegeex/download/request" target="_blank">模型下载</a> | 📃 论文(即将推出!) | ||
</p> | ||
<p align="center"> | ||
🛠 <a href="https://marketplace.visualstudio.com/items?itemName=aminer.codegeex" target="_blank">VS Code插件</a> | 📒 <a href="https://github.com/THUDM/CodeGeeX/blob/main/api/README_zh.md" target="_blank">API申请</a> | ⌨️ <a href="https://models.aminer.cn/codegeex/static/xdaivscodegeex.b65f1404.png" target="_blank">加入开发者群</a> | 🌐 <a href="https://github.com/THUDM/CodeGeeX/blob/main/README.md" target="_blank">English</a> | ||
</p> | ||
|
||
 | ||
 | ||
 | ||
 | ||
 | ||
|
||
# CodeGeeX: 多语言代码生成模型 | ||
|
||
CodeGeeX是一个具有130亿参数的多编程语言代码生成预训练模型。CodeGeeX采用华为MindSpore框架实现,在鹏城实验室“鹏城云脑II”中的192个节点(共1536个国产[昇腾910 AI处理器](https://e.huawei.com/cn/products/servers/ascend))上训练而成。截至2022年6月22日,CodeGeeX历时两个月在20多种编程语言的代码语料库(>8500亿Token)上预训练得到。CodeGeeX有以下特点: | ||
* **高精度代码生成**:支持生成Python、C++、Java、JavaScript和Go等多种主流编程语言的代码,在HumanEval-X代码生成任务上取得47%~60%求解率,较其他开源基线模型有更佳的平均性能。[代码生成示例](https://models.aminer.cn/codegeex/zh-CN) | ||
* **跨语言代码翻译**:支持代码片段在不同编程语言间进行自动翻译转换,翻译结果正确率高,在HumanEval-X代码翻译任务上超越了其它基线模型。[代码翻译示例](https://models.aminer.cn/codegeex/zh-CN/codeTranslator) | ||
* **自动编程插件**:CodeGeeX插件现已上架VSCode插件市场(完全免费),用户可以通过其强大的少样本生成能力,自定义代码生成风格和能力,更好辅助代码编写。[插件下载](https://marketplace.visualstudio.com/items?itemName=aminer.codegeex) | ||
* **模型跨平台开源**: 所有代码和模型权重将会开源,用作研究用途。我们正在适配除昇腾外的其它平台,并将在短期内开源。 | ||
* **模型跨平台开源**: 所有代码和模型权重开源开放,用作研究用途。CodeGeeX同时支持昇腾和英伟达平台,可在单张昇腾910或英伟达V100/A100上实现推理。[申请模型权重](https://models.aminer.cn/codegeex/download/request) | ||
|
||
**全新多编程语言评测基准HumanEval-X**:HumanEval-X是第一个支持功能正确性评测的多语言、多任务的基准,包含820个人工编写的高质量代码生成题目、测试用例与参考答案,覆盖5种编程语言(Python、C++、Java、JavaScript、Go),支持代码生成与代码翻译能力的评测。[如何使用](codegeex/benchmark/README_zh.md) | ||
|
||
<img src="resources/zh/hx_boxplot_zh.png"> | ||
|
||
<p align="center"><i>在HumanEval-X代码生成任务上,与其它开源基线模型相比,CodeGeeX取得了最佳的平均性能。</i> </p> | ||
|
||
## 新闻 | ||
|
||
* **2022-09-30**: 我们开源了跨平台代码和模型权重,同时支持昇腾和英伟达平台。 | ||
## 使用指南 | ||
|
||
CodeGeeX最初使用Mindspore框架实现,并在昇腾910AI芯片上进行训练。为适配更多平台,我们将其转换到[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)框架,支持Pytorch+GPU环境。 | ||
### 安装 | ||
|
||
通过以下命令安装 ``codegeex``: | ||
需要Python 3.7+ / CUDA 11+ / PyTorch 1.10+ / DeepSpeed 0.6+,通过以下命令安装 ``codegeex``: | ||
```bash | ||
git clone [email protected]:THUDM/CodeGeeX.git | ||
cd CodeGeeX | ||
pip install -e . | ||
``` | ||
|
||
### 在GPU上进行推理 | ||
### 模型权重 | ||
|
||
通过[该链接](https://models.aminer.cn/codegeex/download/request)申请权重,您将收到一个包含临时下载链接文件```urls.txt```的邮件。推荐使用[aria2](https://aria2.github.io/)通过以下命令快速下载(请保证有足够的硬盘空间存放权重(~26GB)): | ||
```bash | ||
aria2c -x 16 -s 16 -j 4 --continue=true -i urls.txt | ||
``` | ||
使用以下命令合并得到完整的权重: | ||
```bash | ||
cat codegeex_13b.tar.gz.part.* > codegeex_13b.tar | ||
tar xvf codegeex_13b.tar.gz | ||
``` | ||
|
||
### 用GPU进行推理 | ||
|
||
尝试使用CodeGeeX模型生成第一个程序吧!首先,在配置文件``configs/codegeex_13b.sh``中写明存放权重的路径。其次,将提示(可以是任意描述或代码片段)写入文件``tests/test_prompt.txt``,运行以下脚本即可开始推理(需指定GPU序号): | ||
```bash | ||
bash ./scripts/test_inference.sh <GPU_ID> ./tests/test_prompt.txt | ||
``` | ||
|
||
### VS Code插件使用指南 | ||
|
||
基于CodeGeeX,我们开发了一款免费的VS Code插件,在应用市场搜索“codegeex”或通过[该链接](https://marketplace.visualstudio.com/items?itemName=aminer.codegeex)安装。详细的使用指南在[CodeGeeX插件使用指南](vscode-extension/README_zh.md). | ||
|
||
CodeGeeX最初使用Mindspore框架实现,并在昇腾910AI芯片上进行训练。为了支持更多的平台,我们正在迁移代码和模型,并将在近期内开源。 | ||
|
||
## CodeGeeX: 多语言代码生成模型 | ||
|
||
|
@@ -65,7 +98,7 @@ HumanEval-X中每个语言的样本,包含了声明、描述和解答,它们 | |
<p align="center"><i><b>左侧</b>: HumanEval-X中五种语言具体的pass@k(k=1,10,100)性能。<b>右侧</b>: 模型在所有语言上的平均性能。CodeGeeX的平均表现优于InCoder-6.7B和CodeGen-Multi-6B/16B。</i></p> | ||
|
||
|
||
我们将CodeGeeX与另外两个开源代码生成模型进行比较,分别为Meta的[InCoder](https://github.com/dpfried/incoder)与Salesforce的[CodeGen](https://github.com/salesforce/CodeGen),选取InCoder-6.7B、CodeGen-Multi-6B 与 CodeGen-Multi-16B。CodeGeeX能获得最佳的平均性能,显著超越了参数量更小的模型(7.5%~16.3%的提升),与参数量更大的模型CodeGen-Multi-16B表现相当(平均性能 54.76% vs. 54.39%)。除此之外,我们还探索了将生成次数分配给不同语言的效果,在按照训练数据比例分配生成次数时,CodeGeeX的正确率比其在任一单语言下的正确率都更高(如<span style='color:red'>红色虚线圈</span>所示)。 | ||
我们将CodeGeeX与另外两个开源代码生成模型进行比较,分别为Meta的[InCoder](https://github.com/dpfried/incoder)与Salesforce的[CodeGen](https://github.com/salesforce/CodeGen),选取InCoder-6.7B、CodeGen-Multi-6B 与 CodeGen-Multi-16B。CodeGeeX能获得最佳的平均性能,显著超越了参数量更小的模型(7.5%~16.3%的提升),与参数量更大的模型CodeGen-Multi-16B表现相当(平均性能 54.76% vs. 54.39%)。 | ||
|
||
### 跨语言代码翻译 | ||
|
||
|
@@ -123,3 +156,5 @@ HumanEval-X中每个语言的样本,包含了声明、描述和解答,它们 | |
|
||
[唐杰](http://keg.cs.tsinghua.edu.cn/jietang/)(清华大学知识工程实验室 & 北京智源人工智能研究院) | ||
</details> | ||
|
||
如果遇到问题或有任何建议,欢迎通过邮件与我们联系[[email protected]](mailto:[email protected]). |
Oops, something went wrong.