-
Notifications
You must be signed in to change notification settings - Fork 274
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
update new docs for Paddle hardware adaptation (#934)
* add new docs for hardware adaptation * add new docs for Paddle hardware adaptation * Update Maturity_Model_for_Paddle Large Model Tool Chain_Hardware_Adaptation.md * Delete working_groups/飞桨大模型工具链适配认证标准.md * Delete working_groups/飞桨芯片适配认证标准.md * Update Maturity_Model_for_Paddle Large Model Tool Chain_Hardware_Adaptation.md * Update Maturity_Model_for_Paddle Large Model Tool Chain_Hardware_Adaptation.md * Update Maturity_Model_for_Paddle_Hardware_Adaptation.md * formatting Signed-off-by: Zhang Jun <[email protected]> * minor Signed-off-by: Zhang Jun <[email protected]> --------- Signed-off-by: Zhang Jun <[email protected]> Co-authored-by: Zhang Jun <[email protected]>
- Loading branch information
Showing
2 changed files
with
193 additions
and
0 deletions.
There are no files selected for viewing
80 changes: 80 additions & 0 deletions
80
..._groups/Maturity_Model_for_Paddle Large Model Tool Chain_Hardware_Adaptation.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
## 飞桨大模型工具链适配认证标准 | ||
|
||
<table border="2" > | ||
<tr > | ||
<td width="10%" rowspan="2">分级</td> | ||
<td colspan="2">调优 : SFT + LoRA</td> | ||
<td colspan="2">预训练 Pretrain</td> | ||
<td rowspan="2">DPO</td> | ||
<td colspan="3">推理 Inference</td> | ||
</tr> | ||
<tr > | ||
<td> 模型 </td> | ||
<td> 性能要求 </td> | ||
<td> 模型 </td> | ||
<td> 性能要求 </td> | ||
<td> 模型 </td> | ||
<td> 数据类型支持 </td> | ||
<td> 性能要求 </td> | ||
</tr> | ||
<tr > | ||
<td>I级</td> | ||
<td >LLaMA1-13B</td> | ||
<td rowspan="3">无</td> | ||
<td rowspan="3">LLaMA1-13B</td> | ||
<td>tokens/TFLOPS (取前1000步均值)达到A100/800的20%</td> | ||
<td rowspan="4">待建设</td> | ||
<td> LLaMA1-13B</td> | ||
<td>FP16/ BF16</td> | ||
<td>首token 时延不超过1s的QPS/TFPLOPs达到A800的20%</td> | ||
</tr> | ||
</tr> | ||
<td>II级</td> | ||
<td >Qwen2-14B <br>SD(SFT only)</td> | ||
<td>tokens/TFLOPS (取前1000步均值)达到A100/800的40%</td> | ||
<td>Qwen2-14B<br>SD</td> | ||
<td> int8 (weight only)</td> | ||
<td>首token 时延不超过1s的QPS/TFPLOPs达到A800的40%</td> | ||
</tr> | ||
</tr> | ||
<td>III级</td> | ||
<td>LLaMA3-70B <br>Qwen2-57B-A14B<br>(SFT only)</td> | ||
<td>tokens/TFLOPS (取前1000步均值)达到A100/800的60%</td> | ||
<td>LLaMA3-70B<br>GPT-3-175B(只看性能)<br>Qwen2-57B-A14B</td> | ||
<td> PTQ int8 (int8 * int8)<br>int4(weight only)</td> | ||
<td>首token 时延不超过1s的QPS/TFPLOPs达到A800的40%</td> | ||
</tr> | ||
</tr> | ||
<td>验收要求</td> | ||
<td colspan="2"> 模型效果:<br> | ||
在指定有监督数据集上按给定超参数上完成精调(SFT、LoRA两种精调场景)后,通过无随机性的贪心搜索解码生成方式,在给定 验证集上用ROUGE指标进行评测,与基准加速卡比较,效果指标与GPU结果持平( ± 1%以内),人工评估结果与GPU结果持平。</td> | ||
<td colspan="2">训练精度:<br> | ||
在百GB级别语料按照指定学习率、BatchSize,总步数,最大序列长度等超参后启动预训练任务<br> | ||
•初期模型精度验证:给定初始模型下训练,去除训练随机性,在前1000步训练Loss中每20步取平均值,与GPU训练结果对比相对误差持平;<br> | ||
•后期模型精度验证:收敛后模型在指定验证集上评估Loss,与GPU相比绝对误差需<1e-2;<br> | ||
模型效果:<br> | ||
•收敛后模型在指定数据集评测,准确率与GPU结果持平(± 1%以内);<br> | ||
训练性能:<br> | ||
•每TFLOPS处理的tokens数量(取前1000步均值),达到百度提供benchmark作为基线(各级要求见上);<br> | ||
稳定性:<br> | ||
•从启动预训练到完成训练任务中无出现宕机情况;<br> | ||
•如完成任务时间过长,需保证至少连续14天多机训练不宕机;<br> | ||
•训练期间如遇宕机按照给定Checkpoint热启后Loss无突刺可稳定下降;</td><br> | ||
<td colspan="3">模型效果:<br> | ||
* 模型在指定数据集评测,准确率与GPU结果持平(± 1%以内),人工评估结果与GPU结果持平; | ||
|
||
推理性能: | ||
•每 TFLOPS 处理的 QPS(首 token 时延不超过 1s),达到百度提供 benchmark 作为基线(各级要求见上)。</td> | ||
</tr> | ||
</table> | ||
|
||
## Notes | ||
- [1]每一级是在较低一级的基础上增加模型要求,预训练认证需要满足同级的调优认证要求。 | ||
- [2]LLM 类别推荐适配开源模型列表: | ||
|
||
| 模型 | 代码地址 | | ||
|:------|:-------:| | ||
| GPT-3 | https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/config/gpt-3 | | ||
| LLaMA | https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/config/llama | | ||
| Qwen | https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/config/qwen | | ||
- [3]文生图类别推荐适配开源模型:SD https://github.com/PaddlePaddle/PaddleMIX/blob/develop/ppdiffusers/README.md |
113 changes: 113 additions & 0 deletions
113
working_groups/Maturity_Model_for_Paddle_Hardware_Adaptation.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
## 飞桨芯片适配认证标准 | ||
|
||
<table border="2" > | ||
<tr > | ||
<td rowspan="2">硬件类型</td> | ||
<td rowspan="2">适配项目</td> | ||
<td colspan="4">适配要求</td> | ||
</tr> | ||
<tr > | ||
<td> I 级 </td> | ||
<td> II 级 </td> | ||
<td> III 级</td> | ||
<td> 成熟商用级 </td> | ||
</tr> | ||
<tr > | ||
<td width="10%" rowspan="6">AI训练芯片</td> | ||
<td>模型领域覆盖数量[1]</td> | ||
<td>2</td> | ||
<td>3</td> | ||
<td>4</td> | ||
<td>4</td> | ||
</tr> | ||
<td>模型数量[2] [9]</td> | ||
<td>2</td> | ||
<td>15</td> | ||
<td>30</td> | ||
<td>30</td> | ||
</tr> | ||
</tr> | ||
<td>算子种类</td> | ||
<td>60</td> | ||
<td>250</td> | ||
<td>350</td> | ||
<td>350</td> | ||
</tr> | ||
</tr> | ||
<td>分布式训练</td> | ||
<td>单机单卡</td> | ||
<td>单机多卡</td> | ||
<td>多机多卡</td> | ||
<td>多机多卡</td> | ||
</tr> | ||
</tr> | ||
<td>大模型</td> | ||
<td>无要求</td> | ||
<td>推理 I 级</td> | ||
<td>精调 I 级</td> | ||
<td>预训练 III 级</td> | ||
</tr> | ||
</tr> | ||
<td>CI搭建</td> | ||
<td>无要求</td> | ||
<td>无要求</td> | ||
<td>覆盖编译+单测</td> | ||
<td>覆盖编译+单测</td> | ||
</tr> | ||
<tr > | ||
<td width="10%" rowspan="3">AI推理芯片(数据中心)</td> | ||
<td>模型领域覆盖数量</td> | ||
<td>2</td> | ||
<td>3</td> | ||
<td>4</td> | ||
<td>[7]</td> | ||
</tr> | ||
</tr> | ||
<td>模型数量[10] </td> | ||
<td>2[5] /10[6] </td> | ||
<td>15[5] /50[6] </td> | ||
<td>50[5] /100[6] </td> | ||
<td>[7]</td> | ||
</tr> | ||
</tr> | ||
<td>算子种类</td> | ||
<td>30[5] /35[6]</td> | ||
<td>75</td> | ||
<td>175[5] /120[6] </td> | ||
<td>[7]</td> | ||
</tr> | ||
<tr > | ||
<td width="10%" rowspan="6">AI推理芯片(移动/边缘计算)</td> | ||
<td>模型领域覆盖数量</td> | ||
<td>1</td> | ||
<td>2</td> | ||
<td>3</td> | ||
<td>[7]</td> | ||
</tr> | ||
</tr> | ||
<td>模型数量[10]</td> | ||
<td>3 </td> | ||
<td>20 (如支持量化模型,数量可降至10)</td> | ||
<td>50(如支持量化模型,数量可降至30) </td> | ||
<td>[7]</td> | ||
</tr> | ||
</tr> | ||
<td>算子种类</td> | ||
<td>20</td> | ||
<td>40</td> | ||
<td>75</td> | ||
<td>[7]</td> | ||
</tr> | ||
</table> | ||
|
||
## Notes | ||
- [1] 模型领域包括:视觉(分类、检测、分割)/OCR/NLP/时间序列 | ||
- [2] 飞桨开源模型库包括大量经典模型和飞桨特色模型,每个模型有其所属领域,地址:https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-beta/docs/tutorials/models/support_model_list.md | ||
- [3]基于全量数据集,端到端训推精度对齐 | ||
- [4] 基础训推功能验证 | ||
- [5] 以 Paddle Inference 适配 | ||
- [6] 以 Paddle Lite/ONNX/TVM 适配 | ||
- [7] 针对该类芯片,暂无此级别适配标准 | ||
- [8] 飞桨硬件适配全量算子列表:https://github.com/onecatcn/my-demo-code/blob/develop/PaddlePaddle/ops/gpu_ops_2023-03-20.csv | ||
- [9] 训练精度要求:FP32 训练精度下误差小于正负 0.3%,AMP 混合精度训练下误差小于正负 3%,满足其中一个要求即可。 | ||
- [10] 推理精度要求:和 GPU/CPU 精度一致(移动边缘类芯片的量化模型预计有特殊损失,硬件厂商提供精度损失说明,由飞桨研发同学判断其合理性) |