The IREE Dispatch Profiler is a Python-based tool designed to achieve two primary objectives: functional verification and performance profiling for individual dispatches, such as matrix multiplication, batch matrix multiplication, and convolutions. This tool ensures that performance optimizations maintain functionality and provides a convenient way to quantitatively measure performance. Additionally, the tool offers dispatch generation and compilation capabilities. In summary, the IREE dispatch profiler accomplishes the following:
- Auto-generation of MLIR dispatches (e.g., matmul, batch_matmul, convolutions, fused dispatches).
- Compilation of generated MLIR dispatches into binaries (vmfb).
- Functional verification against Python-based reference implementations.
- Performance profiling and reporting.
- Operation: An operation structure captures and refers to the functional description of an operation. For example, a Matmul operation includes the datatype, layout, and matrix multiplication problem shape.
- Tuning Configuration: Tuning configurations are attributes applied to the IREE compilation flow that can alter the performance of the compiled dispatch without affecting its functionality.
- Dispatch: A dispatch is a combination of an operation and its corresponding tuning configuration.
IREE dispatch profiler provides generator.py
that can be used to generate dispatches. Please find a sample run below:
$ python3 dispatch_profiler/generator.py --generated-dir </path/to/create/`generated`/dir>
[Generating]: ./generated/linalg/matmul/matmul_128x128x256_f16t_f16t_f16t/matmul_128x128x256_f16t_f16t_f16t.mlir
Emitting tuning configuration : tile_config_128x128_64x4_tensorcore_mmasync
Emitting tuning configuration : tile_config_128x128_32x5_tensorcore_mmasync
Emitting tuning configuration : tile_config_128x64_32x5_tensorcore_mmasync
Emitting tuning configuration : tile_config_64x64_64x5_tensorcore_mmasync
Emitting tuning configuration : tile_config_64x64_32x10_tensorcore_mmasync
...
This creates a generated
folder containing dispatches organized in folders as mlir_dialect/operation_name/
. The folder includes an .mlir file with all the dispatches for an operation.
The generator.py
script serves as a generator for implemented operation data types, using a predefined list of problem shapes. You can also provide specific matrix multiplication shapes of interest. Examples are provided below.
python3 ../iree/experimental/dispatch_profiler/generator.py --generated-dir </path/to/create/`generated`/dir> --problem-m=768 --problem-n=512 --problem-k=1024
...
[Generating]: ./generated/linalg/matmul/matmul_768x512x1024_f16t_f16t_f16t/matmul_768x512x1024_f16t_f16t_f16t.mlir
[Generating]: ./generated/linalg/matmul/matmul_768x512x1024_f32t_f32t_f32t/matmul_768x512x1024_f32t_f32t_f32t.mlir
...
Generate matmuls where M ranges from 64 to 1024 in increments of 128, N varies from 64 to 1024 in steps of 128, and K is fixed at 4096.
$ python3 ../iree/experimental/dispatch_profiler/generator.py --generated-dir </path/to/create/`generated`/dir> --problem-m=64:1024:128 --problem-n=64:1024:128 --problem-k=4096
...
IREE dispatch profiler provies compile.py
that trigges iree-compile
with appropiate compilation flags. The output of iree-compile
vmfb files are placed in mlir_dialect/operation_path/operation_name.mlir
. The compiler.py
uses all the possible cpus on your machine to compile all different generated mlir source files.
python3 ../iree/experimental/dispatch_profiler/compile.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir>
Compiles all the generated source mlir dispatches. One can check the generated dispatched folder to find the vmfb files.
$ ls ./generated/linalg/matmul/matmul_64x64x4096_f16t_f16t_f16t/
iree_compile_cmd_stdout.mlir matmul_64x64x4096_f16t_f16t_f16t.mlir matmul_64x64x4096_f16t_f16t_f16t_profile.vmfb matmul_64x64x4096_f16t_f16t_f16t_verify.vmfb
The tool provides profiler.py
script which can be used to trigger both verification and profiler for all the compiled dispatches. Please find some example profiling commandlines below:
$ python3 profiler.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir> --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync --verification-enabled=true --profiling-enabled=true
----------------------------------------------------------------
Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync
Provider : IREE Codegen
OpKind : OperationKind.Matmul
Operation : matmul_3456x1024x2048_f16t_f16t_f16t
Configuration : tile_config_128x128_32x5_tensorcore_mmasync
Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t
--split_k_mode=N/A --split_k_slices=N/A
Verification : SUCCESS
Runtime(ms) : 0.062
GFLOPs : 233798.62
Verification, particularly for large matrix multiplications, can be time-consuming when using a CPU-based numpy reference. To prioritize profiling speed and when functional correctness is assured, disable verification using --verification-enabled=false
.
python3 profiler.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir> --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync --verification-enabled=false --profiling-enabled=true
The --dispatch
option accepts a comma-separated list of regex patterns to profile all tuning configurations generated for a operation. The command-line argument is formatted as --dispatch=<regex>,<regex>
. Additionally, you can export the profiled output to a CSV file for further analysis using --output=<filepath>
.
$ python3 profiler.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir> --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_*_tensorcore_mmasync --verification-enabled=false --profiling-enabled=true --output=data.csv
----------------------------------------------------------------
Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x256_32x3_tensorcore_mmasync
Provider : IREE Codegen
OpKind : OperationKind.Matmul
Operation : matmul_3456x1024x2048_f16t_f16t_f16t
Configuration : tile_config_128x256_32x3_tensorcore_mmasync
Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t
--split_k_mode=N/A --split_k_slices=N/A
Verification : Not verified
Runtime(ms) : 0.062
GFLOPs : 233798.62
----------------------------------------------------------------
Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_64x4_tensorcore_mmasync
Provider : IREE Codegen
OpKind : OperationKind.Matmul
Operation : matmul_3456x1024x2048_f16t_f16t_f16t
Configuration : tile_config_128x128_64x4_tensorcore_mmasync
Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t
--split_k_mode=N/A --split_k_slices=N/A
Verification : Not verified
Runtime(ms) : 0.064
GFLOPs : 226492.42
----------------------------------------------------------------
...
----------------------------------------------------------------
Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_64x64_32x10_tensorcore_mmasync
Provider : IREE Codegen
OpKind : OperationKind.Matmul
Operation : matmul_3456x1024x2048_f16t_f16t_f16t
Configuration : tile_config_64x64_32x10_tensorcore_mmasync
Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t
--split_k_mode=N/A --split_k_slices=N/A
Verification : Not verified
Runtime(ms) : 0.103
GFLOPs : 140733.15
Writing performance report to data.csv
Another example showcasing the use of --dispatch
to profile a matmul_3456x1024x2048 targetting F16 and F32 NVIDIA A100 Tensor Cores.
$ python3 profiler.py --build-dir </path/to/iree/build/dir> --generated-dir </path/to/create/`generated`/dir> --dispatches=matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync,matmul_3456x1024x2048_f32t_f32t_f32t_tile_config_128x128_16x5_tensorcore_mmasync
----------------------------------------------------------------
Dispatch : matmul_3456x1024x2048_f16t_f16t_f16t_tile_config_128x128_32x5_tensorcore_mmasync
Provider : IREE Codegen
OpKind : OperationKind.Matmul
Operation : matmul_3456x1024x2048_f16t_f16t_f16t
Configuration : tile_config_128x128_32x5_tensorcore_mmasync
Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f16t --rhs=f16t --result=f16t
--split_k_mode=N/A --split_k_slices=N/A
Verification : SUCCESS
Runtime(ms) : 0.062
GFLOPs : 233798.62
----------------------------------------------------------------
Dispatch : matmul_3456x1024x2048_f32t_f32t_f32t_tile_config_128x128_16x5_tensorcore_mmasync
Provider : IREE Codegen
OpKind : OperationKind.Matmul
Operation : matmul_3456x1024x2048_f32t_f32t_f32t
Configuration : tile_config_128x128_16x5_tensorcore_mmasync
Arguments : --batch_count=1 --m=3456 --n=1024 --k=2048 --lhs=f32t --rhs=f32t --result=f32t
--split_k_mode=N/A --split_k_slices=N/A
Verification : SUCCESS
Runtime(ms) : 0.122
GFLOPs : 118815.69
----------------------------------------------------------------