GitHub - Jackch-NV/TRTLLM-w4afp8-fp8-mix-inference

下载源码

git clone https://github.com/NVIDIA/TensorRT-LLM -b v0.11.0

拷贝修改的文件到指定目录

cd TensorRT-LLM
cp ./w4afp8_fp8_example/build_and_run.sh examples/llama/
cp ./w4afp8_fp8_example/combine_fp8_into_w4a8_awq.py examples/llama/
cp ./w4afp8_fp8_example/examples/quantization/quantize.py examples/quantization/ 
cp ./w4afp8_fp8_example/tensorrt_llm/builder.py tensorrt_llm/builder.py
cp ./w4afp8_fp8_example/tensorrt_llm/models/modeling_utils.py tensorrt_llm/models/modeling_utils.py 
cp ./w4afp8_fp8_example/tensorrt_llm/quantization/quantize.py tensorrt_llm/quantization/quantize.py 
cp ./w4afp8_fp8_example/tensorrt_llm/quantization/quantize_by_modelopt.py tensorrt_llm/quantization/quantize_by_modelopt.py

编译安装

python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --cuda_architectures "90" -c
pip install ./build/tensorrt_llm*.whl

运行测试

cd examples/llama/

我们可以考虑让以下5个gemm 设置为fp8 精度, 以下是gemm和它对应的fp8_modules_list(用逗号分隔，不要有空格）

qkv gemm : "q_proj,k_proj,v_proj"

attention o_proj : "o_proj"

mlp up_proj : "up"

mlp gate : "gate"

mlp down_proj : "down"

我们也可以增加"layers.xxx"来控制只把某些层的某些gemm设置为fp8，比如只把第22层和23层的gate gemm 设置为fp8，则fp8_modules_list = "*layers.22.gate,*layers.23.gate"

运行示例

bash -x build_and_run.sh $LLAMA_PATH "*q_proj*,*k_proj*,*k_proj*"
bash -x build_and_run.sh $LLAMA_PATH "*q_proj*,*k_proj*,*k_proj*,*up*" 
bash -x build_and_run.sh $LLAMA_PATH "*q_proj*,*k_proj*,*k_proj*,*down*" 
bash -x build_and_run.sh $LLAMA_PATH "*q_proj*,*k_proj*,*k_proj*,*gate*" 
bash -x build_and_run.sh $LLAMA_PATH "*q_proj*,*k_proj*,*k_proj*,*o_proj*"
bash -x build_and_run.sh $LLAMA_PATH "*layers.1.*"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples/quantization		examples/quantization
tensorrt_llm		tensorrt_llm
build_and_run.sh		build_and_run.sh
combine_fp8_into_w4a8_awq.py		combine_fp8_into_w4a8_awq.py
readme.md		readme.md

Jackch-NV/TRTLLM-w4afp8-fp8-mix-inference

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages