-
Notifications
You must be signed in to change notification settings - Fork 660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Quantization] int8 compute with PostTraining Quantization (W8A8) #2432
Comments
The compiler further lowers the CoreML model. It recognizes patterns like |
The quant / dequant layers are expected. These are inserted in the torch model during the “prepare” stage to simulate quantization effects during training, allowing the model to adapt to reduced precision, while still training in fp16. During training we also estimate scale and zero points for actual quantization. The speedup for W8A8 models comes from int8-int8 compute on the Neural Engine (https://apple.github.io/coremltools/docs-guides/source/opt-quantization-perf.html#performance). The NE compiler automatically detects and fuses dequant -> op (s) -> quant patterns in the model to run int8-int8 compute wherever possible. I believe your current quantized model should run integer only compute when run on NE. |
Could be this, but due to "1" for channel in conv https://machinelearning.apple.com/research/neural-engine-transformers For example, if the last axis is used as a singleton one by the model implementation’s data format, it will be padded to 64 bytes, which results in 32 times the memory cost in 16-bit and 64 times the memory cost in 8-bit precision. Such an increase in buffer size will significantly reduce the chance of L2 cache residency and increase the chance of hitting DRAM. This is not desirable from a power and latency perspective. |
❓Question
Im trying to understand the runtime behavior of W8A8 quantized networks on CoreML. I have written up a very simple model as follows :
I convert this model to W8A8 format using post-training quantisation and export it to an
.mlpackage
via the following code :From the graph of the exported coreml-model :
My questions are :
The text was updated successfully, but these errors were encountered: