04/?? deploy <llama.onnx, quant table> to aarch64
04/19 remove GPTQ zero point guidance
04/18 export mixed-precision quant table from GPTQ-for-LLaMa
04/11 add 13GB onnx-fp16 models
04/11 add memory pool, support 2GB RAM laptop ⭐
04/10 reduce onnx model size to 26GB
04/10 support temperature
add topk
logits warp
04/07 add onnxruntime demo
04/05 init project
- Release llama 7B onnx models
- With a 400-lines onnxruntime alpaca demo
- neither
torch
nortransformers
required - support memory pool, works on 2GB laptop/PC (very slow 🐢)
- neither
Why do this ?
- Visualization.
graphviz
crashed on llama model. LLM visualization tool must support nest or operator folding feature - Quatization. LLM often repeat itself, just like fractal. For llama quantization, loading part of decoder backbone would be enough (400MB). It could be quantized partially
- Embeded device. Small board IO error occurs when
dd
a big single file - Distributed system. Inference LLM on many hybrid (FPGA/NPU/GPGPU) devices would be simple
- onnx tools. Device manufacturer has support onnx well, there is no reason to neglect it
Download onnx models here:
Precision | Size | URL |
---|---|---|
fp32 | 26GB | huggingface |
fp16 | 13GB | huggingface or 硬件模型库 |
Here is the graph to call them:
Try onnxruntime
demo, no torch
required, and the precision has been checked.
$ python3 -m pip install -r requirements.txt
$ python3 demo-single.py ${FP16_ONNX_DIR} "bonjour"
..
# If you only have 4GB memory, use `--poolsize`
$ python3 demo-single.py ${FP16_ONNX_DIR} "bonjour" --poolsize 4
..
Bonjour.
# Try more options
$ python3 demo-single.py --help
STEP1 Convert to HF format
These models converted from alpaca huggingface.
-
If you are using LLaMa or llama.cpp, convert it to HF format first. Here are steps:
# install transformers master $ git clone https://github.com/huggingface/transformers $ cd transformers && python3 setup.py install .. $ cd src/transformers $ python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ${LLaMa_PATH} --model_size 7B --output_dir ${HF_PATH}
-
If you are using alpaca-lora, use this script to merge LoRA weights.
-
If you are using alpaca, go STEP2.
STEP2 torch.onnx.export
Checkout transformers to this hacking branch, run single inference.
$ python3 tools/export-onnx.py ${PATH_ALPACA_7B}
STEP3 convert to fp16
Use onnxconverter-common.float16
$ cd tools
$ python3 -m pip install -r requirements.txt
$ python3 convert-fp32-to-fp16.py ${FP32_PATH} ${FP16_PATH}
Mixed-precision kernel optimization is on the way. Here is a part of guidance.
- Any
logits_processor
orBeamSearch
not implemented, so the result would be not good - I have compared the output values of
onnxruntime-cpu
andtorch-cuda
, and the maximum error is 0.002, not bad - The current state is equivalent to these configurations
temperature=0.1
total_tokens=2000
top_p=1.0
top_k=40
repetition_penalty=1.0