A uniform "fake" quantization method supports an arbitrary number of bits (>=2) which is used to represent weights and activations. The method performs differentiable sampling of the continuous signal (for example, activations or weights) during forward pass, simulating inference with integer arithmetic.
Quantization is parametrized by clamping range and number of quantization levels. The sampling formula is the following:
input_low
and input_high
represent the quantization range and denotes rounding to the nearest integer.
During the training, we optimize the scale parameter that represents the range [input_low, input_range]
of the original signal using gradient descent:
In the formula above, level_low
and level_high
represent the range of the discrete signal.
For all the cases listed above, the common quantization formula is simplified after substitution of input_low
, input_high
and levels
:
Use the num_init_samples
parameter from the initializer
group to initialize the values of scale
and determine which activation should be signed or unsigned from the collected statistics using given number of samples.
During the training we optimize the input_low and input_range parameters using gradient descent:
For better accuracy, floating-point zero should be within quantization range and strictly mapped into quant (without rounding). Therefore, the following scheme is applied to ranges of weights and activations before quantization:
You can use the num_init_samples
parameter from the initializer
group to initialize the values of input_low
and input_range
from the collected statistics using given number of samples.
NNCF allows to quantize models for best results on a given Intel hardware type when executed using OpenVINO runtime. To achieve this, the quantizer setup should be performed with following considerations in mind:
- every operation that can accept quantized inputs on a given HW (i.e. can be executed using quantized input values) should have its inputs quantized in NNCF
- the quantized inputs should be quantized with a configuration that is supported on a given HW for a given operation (e.g. per-tensor vs per-channel quantization, or 8 bits vs. 4 bits)
- for operations that are agnostic to quantization, the execution should handle quantized tensors rather than full-precision tensors.
- certain operation sequences will be runtime-optimized to execute in a single kernel call ("fused"), and additional quantizer insertion/quantization simulation within such operation sequences will be detrimental to overall performance
These requirements are fulfilled by the quantizer propagation algorithm. The algorithm first searches the internal NNCF representation of the model's control flow graph for predefined patterns that are "fusable", and apply the fusing to the internal graph representation as well. Next, the operations in the graph that can be associated to input-quantizable operations on a given target hardware are assigned a single quantizer for each its quantizable activation input, with a number of possible quantizer configurations attached to it (that are feasible on target HW). The quantizers are then "propagated" against the data flow in the model's control flow graph as far as possible, potentially merging with other quantizers. Once all quantizers have reached a standstill in their propagation process, each will have a final (possibly reduced) set of possible quantizer configurations, from which a single one is either chosen manually, or using a precision initialization algorithm (which accepts the potential quantizer locations and associated potential quantizer configuration sets). The resulting configuration is then applied as a final quantizer setup configuration.
Note that this algorithm applies to activation quantization only - the weight quantizers do not require propagation. However, the possible configurations of weight quantizers themselves are also sourced from the HW config file definitions.
The HW to target for a given quantization algorithm run can be specified in NNCF config using the global "target_device"
option.
The default corresponds to CPU-friendly quantization.
"TRIAL"
corresponds to a configuration that uses the general quantizer propagation algorithm, but does not use any HW-specific information about quantizability of given operation types or possible quantizer configs for associated inputs or operation weights.
Instead it uses a default, basic 8-bit symmetric per-tensor quantization configuration for each quantizer, and quantizes inputs of a certain default operation set, which at the moment is defined internally in NNCF.
The quantization configuration in the "target_device": "TRIAL"
case may be overridden using the regular "activations"
and "weights"
sections in the quantization compression algorithm sub-config, see below.
For all target HW types, parts of the model graph can be marked as non-quantizable by using the "ignored_scopes"
field - inputs and weights of matching nodes in the NNCF internal graph representation will not be quantized, and the downstream quantizers will not propagate upwards through such nodes.
In our implementation, we use a slightly transformed formula. It is equivalent by order of floating-point operations to simplified symmetric formula and the assymetric one. The small difference is addition of small positive number eps
to prevent division by zero and taking absolute value of range, since it might become negative on backward:
Quantization to lower precisions (e.g. 6, 4, 2 bits) is an efficient way to accelerate inference of neural networks. Although NNCF supports quantization with an arbitrary number of bits to represent weights and activations values, choosing ultra-low bitwidth could noticeably affect the model's accuracy. A good trade-off between accuracy and performance is achieved by assigning different precisions to different layers. NNCF utilizes the HAWQ-v2 method to automatically choose optimal mixed-precision configuration by taking into account the sensitivity of each layer, i.e. how much lower-bit quantization of each layer decreases the accuracy of model. The most sensitive layers are kept at higher precision. The sensitivity of the i-th layer is calculated by multiplying the average Hessian trace with the L2 norm of quantization perturbation:
The sum of the sensitivities for each layer forms a metric which serves as a proxy to the accuracy of the compressed model: the lower the metric, the more accurate should be the corresponding mixed precision model on the validation dataset.
To find the optimal trade-off between accuracy and performance of the mixed precision model we also compute a compression ratio - the ratio between bit complexity of a fully INT8 model and mixed-precision lower bitwidth one. The bit complexity of the model is a sum of bit complexities for each quantized layer, which are defined as a product of the layer FLOPS and the quantization bitwidth. The optimal configuration is found by calculating the sensitivity metric and the compression ratio for all possible bitwidth settings and selecting the one with the minimal metric value among all configurations with a compression ratio below the specified threshold.
By default, the compression ratio is 1.5. It should be enough to compress the model with no more than 1% accuracy drop.
But if it doesn't happen, the lower ratio can be set by compression_ratio
parameter in the precision
section of
configuration file.
To avoid the exponential search procedure, we apply the following restriction: layers with a small average Hessian trace value are quantized to lower bitwidth and vice versa.
The Hessian trace is estimated with the randomized Hutchinson algorithm. Given Rademacher distributed random vector v, the trace of symmetric matrix H is equal to the estimation of a quadratic form:
The randomized algorithm solves the expectation by Monte Carlo using sampling of v from its distribution, evaluating the quadratic term, and averaging:
Evaluation of the quadratic term happens by computing - the result of multiplication of the Hessian matrix with a given random vector v, without the explicit formation of the Hessian operator. For gradient of the loss with respect to the i-th block and for a random vector v, which is independent of , we have the equation:
where is the Hessian matrix of loss with respect to . Hence can be computed by 2 backpropagation passes: first - with respect to the loss and second - with respect to the product of the gradients and a random vector.
The aforementioned procedure sets bitwidth for weight quantizers only. Bitwidth for activation quantizers is assigned
on the next step in two ways: strict or liberal. All quantizers between modules with quantizable inputs have the same
bitwidth in the strict mode. Liberal mode allows different precisions within the group. For both cases, bitwidth is
assigned based on the rules of the hardware config. If multiple variants are possible the minimal compatible bitwidth
is chosen. By default, liberal mode is used as it does not reject a large number of possible bitwidth settings.
The bitwidth_assignment_mode
parameter can override it to the strict one.
For automatic mixed-precision selection it's recommended to use the following template of configuration file:
"optimizer": {
"base_lr": 3.1e-4,
"schedule_type": "plateau",
"type": "Adam",
"scheduler_params": {
"threshold": 0.1,
"cooldown": 3
},
"weight_decay": 1e-05
},
"compression": {
"algorithm": "quantization",
"initializer": {
"precision": {
"type": "hawq",
"bits": [4,8]
"compression_ratio": 1.5,
}
}
}
Note, optimizer parameters are model specific, this template contains optimal ones for ResNet-like models.
Here's an example of using the template in the full configuration file.
This template uses plateau
scheduler. Though it usually leads to a lot of epochs of tuning for achieving a good
model's accuracy, this is the most reliable way. Staged quantization is an alternative approach and can be more than
two times faster, but it may require tweaking of hyper-parameters for each model. Please refer to configuration files
ending by *_staged
for an example of this method.
The manual mode of mixed-precision quantization is also available by explicitly setting the bitwidth per layer
through bitwidth_per_scope
parameter.
NOTE
Precision initialization overrides bits settings specified in weights
and activations
sections of configuration
file.
After the compression-related changes in the model have been committed, the statistics of the batchnorm layers
(per-channel rolling means and variances of activation tensors) can be updated by passing several batches of data
through the model before the fine-tuning starts. This allows to correct the compression-induced bias in the model
and reduce the corresponding accuracy drop even before model training. This option is common for quantization, magnitude
sparsity and filter pruning algorithms. It can be enabled by setting a non-zero value of num_bn_adaptation_samples
in the
batchnorm_adaptation
section of the initializer
configuration (see example below).
Quantization configuration file parameters:
{
"algorithm": "quantization",
"initializer": {
"range": {
"num_init_samples": 256, // Number of samples from the training dataset to consume as sample model inputs for purposes of setting initial minimum and maximum quantization ranges
"type": "minmax" // Type of the initializer - determines which statistics gathered during initialization will be used to initialize the quantization ranges
},
"precision": {
"type": "hawq", // Type of precision initialization - either "manual" or "hawq". With "manual", precisions are defined explicitly via "bitwidth_per_scope". With "hawq", these are determined automatically using the HAWQ algorithm.
"bits": [4, 8], // A list of bitwidth to choose from when performing precision initialization. Overrides bitwidth constraints specified in `weight` and `activation` sections",
"num_data_points": 100, // Number of data points to iteratively estimate Hessian trace, 100 by default.
"iter_number": 200, // Maximum number of iterations of Hutchinson algorithm to estimate Hessian trace, 200 by default
"tolerance": 1e-4, // Minimum relative tolerance for stopping the Hutchinson algorithm. It's calculated between mean average trace from previous iteration and current one. 1e-4 by default
"compression_ratio": 1.5, // The desired ratio between bits complexity of fully INT8 model and mixed-precision lower-bit one.
"bitwidth_per_scope": [ // Manual settings for the quantizer bitwidths. Scopes are used to identify the weight quantizers. The same number of bits is assigned to adjacent activation quantizers. By default bitwidth is taken from global quantization parameters from `weights` and `activations` sections above
[
4,
"MobileNetV2/Sequential[features]/InvertedResidual[8]/Sequential[conv]/NNCFConv2d[0]/ModuleDict[pre_ops]/UpdateWeight[0]/AsymmetricQuantizer[op]"
], // A tuple of a bitwidth and a scope
[
4,
"ModuleDict/AsymmetricQuantizer[MobileNetV2/Sequential[features]/InvertedResidual[15]/Sequential[conv]/ReLU6[5]/hardtanh_0]"
]
]
}
"batchnorm_adaptation": {
"num_bn_adaptation_samples": 2048, // Number of samples from the training dataset to pass through the model at initialization in order to update batchnorm statistics of the original model. The actual number of samples will be a closest multiple of the batch size.
"num_bn_forget_samples": 1024, // Number of samples from the training dataset to pass through the model at initialization in order to erase batchnorm statistics of the original model (using large momentum value for rolling mean updates). The actual number of samples will be a closest multiple of the batch size.
}
}
"weights": { // Constraints to be applied to model weights quantization only.
"mode": "symmetric", // Mode of quantization
"bits": 8, // Bitwidth to quantize to. It is intended to manually specify bitwidth for all weights. Can be overridden by the `bits` parameter from the `precision` initializer section. An error happens if it doesn't match a bitwidth constraints for module weight specified in the hardware configuration.
"signed": true, // Whether to use signed or unsigned input/output values for quantization. If specified as unsigned and the input values during initialization have differing signs, will reset to performing signed quantization instead.
"per_channel": false, // Whether to quantize inputs per channel (i.e. per 0-th dimension for weight quantization,and per 1-st dimension for activation quantization)
// A list of model control flow graph node scopes to be ignored for this operation - functions as a 'denylist'. Optional.
"ignored_scopes": []
// A list of model control flow graph node scopes to be considered for this operation - functions as a 'allowlist'. Optional.
// "target_scopes": []
},
"activations": { // Constraints to be applied to model activations quantization only.
"mode": "symmetric", // Mode of quantization
"bits": 4, // Bitwidth to quantize to. It is intended to manually specify bitwidth for all activations. Can be overridden by the `bits` parameter from the `precision` initializer section. An error happens if it doesn't match a bitwidth constraints for module inputs specified in the hardware configuration.
"signed": true, // Whether to use signed or unsigned input/output values for quantization. If specified as unsigned and the input values during initialization have differing signs, will reset to performing signed quantization instead.
"per_channel": false, // Whether to quantize inputs per channel (i.e. per 0-th dimension for weight quantization,and per 1-st dimension for activation quantization)
// A list of model control flow graph node scopes to be ignored for this operation - functions as a 'denylist'. Optional.
"ignored_scopes": []
// A list of model control flow graph node scopes to be considered for this operation - functions as a 'allowlist'. Optional.
// "target_scopes": []
// Specifies points in the model which will share the same quantizer module for activations. This is helpful in case one and the same quantizer scale is required for inputs to the same operation. Each sub-array will define a group of activation quantizer insertion points that have to share a single actual quantization module, each entry in this subarray should correspond to exactly one node in the NNCF graph and the groups should not overlap. The finalquantizer for each sub-array will be associated with the first element of this sub-array.
"linked_quantizer_scopes": []
},
"quantize_inputs": true, // Whether the model inputs should be immediately quantized prior to any other model operations."
"quantizable_subgraph_patterns": [ // Each sub-list in this list will correspond to a sequence of operations in the model control flow graph that will have a quantizer appended at the end of the sequence
[
"cat",
"batch_norm"
],
[
"h_swish"
]
]
"scope_overrides": { // This option is used to specify overriding quantization constraints for specific scope, e.g. in case you need to quantize a single operation differently than the rest of the model.
"{re}.*InvertedResidual.*": {
"mode": "symmetric", // Mode of quantization
"bits": 4, // Bitwidth to quantize to.
"signed": true, // Whether to use signed or unsigned input/output values for quantization. If specified as unsigned and the input values during initialization have differing signs, will reset to performing signed quantization instead.
"per_channel": false // Whether to quantize inputs per channel (i.e. per 0-th dimension for weight quantization,and per 1-st dimension for activation quantization)
}
},
// A list of model control flow graph node scopes to be ignored for this operation - functions as a 'denylist'. Optional.
"ignored_scopes": [],
// A list of model control flow graph node scopes to be considered for this operation - functions as a 'allowlist'. Optional.
// "target_scopes": [],
// Determines how should the additional quantization operations be exported into the ONNX format. Set this to false for export to OpenVINO-supported FakeQuantize ONNX, or to true for export to ONNX standard QuantizeLinear-DequantizeLinear node pairs (8-bit quantization only in the latter case). Default: false
"export_to_onnx_standard_ops": false,
}
Per layer ranges initializations parameters:
Per layer ranges initiaization can be enabled by specifying in "initializer"
section "range"
as list of dictionaries in the following format:
{
"range": [
{
"type": "min_max", // Type of the initializer - determines which statistics gathered during initialization will be used to initialize the quantization ranges for all modules specified by `"target_scopes"` or `"ignored_scopes"`.
"num_init_samples": 256, // Number of samples from the training dataset to consume as sample model inputs for purposes of setting initial minimum and maximum quantization ranges
"target_scopes": [], // A list of model control flow graph node scopes to be considered for this operation - functions as a 'allowlist'. Optional.
"ignored_scopes": [], // A list of model control flow graph node scopes to be ignored for this operation - functions as a 'denylist'. Optional.
"target_quantizer_group": "weights" // Type of quantizer group to which this initialization of ranges will be applied. Optional. (By default this initialization of ranges will be applied to weights and activations quantizers)
},
...
]
}
Initialization of ranges defined in this way must specify an unambiguous initialization rule for each module.