Antares is an automatic engine to generate multi-platform kernels with optimization for DNN developers (targeting to backends like CUDA/ROCm/CPU/DirectX12/Graphcore/OneAPI/..). It is also a framework for Hardware developers to extend new backends/hareware quickly and easily. Antares provides IR that follows "One Language Syntax for All Platforms", and general-purpose device access APIs that hide the differences of not only DNN description but also device mapping.
-
- Backend Extension
- Effective Auto Tuning
- Einsum-based Antares IR
- Framework JIT Extension (Op Maker Plugin for Pytorch/Tensorflow/Tensorflow2)
-
- Senario-1: Quick Start for Developers that Use Antares to Tune Operator/Sub-graph in Foreground Terminal
- Senario-2: Quick Start for Developers that Use Antares to Extend Operator/Sub-graph in Pytorch/Tensorflow
-
Antares Pre-dependencies for Different Backends
- Linux-based: cuda, rocm, mcpu, scpu, gc, sycl_intel, sycl_cuda, ocl_amdgpu, ocl_nvidia, ocl_android, ..
- Windows-based: cuda_win64, rocm_win64, hlsl_win64, ..
The current version of Antares supports code generation for the following backends (in orange blocks) and devices (in black blocks):
Auto tuning by Antares contributes to not only much less tuning time, but also equivalent or better performance for Intra-op/Inter-op execution (against TVM Ansor).
- Antares IR is the frontend of both kernel generation and automatic optimization.
- The syntax of Antares IR is slim to describe most MLP/CNN/RNN/LSTM/Transformer based models like MNIST/ResNet/BERT/GPT/..
E.g. The following computation logic describes a layer of standard BERT transformer:
merged_layer_local[R, B, S1, N1, H1] +=! input_tensor[B, S1, N, H] * qkv_weight[R, N, H, N1, H1];
merged_layer_trans[R, B, N1, S1, H1] = merged_layer_local[R, B, S1, N1, H1] + qkv_bias[R, N1, H1];
attention_scores[B, N1, S1, S2] +=! merged_layer_trans[0, B, N1, S1, H1] * merged_layer_trans[1, B, N1, S2, H1] / const({H}).cast(`float32`);
softmax_1_temp0[B, N1] >=! attention_scores[B, N1, S1, S2];
softmax_1_temp1[B, N1] +=! (attention_scores[B, N1, S1, S2] - softmax_1_temp0[B, N1]).call(`exp`);
attention_probs[B, N1, S1, S2] = (attention_scores[B, N1, S1, S2] - softmax_1_temp0[B, N1]).call(`exp`) / softmax_1_temp1[B, N1];
... ...
layer_norm_2_src[B, S1, N2, H2] = layer_output[B, S1, N2, H2] + attention_output_norm[B, S1, N2, H2];
layer_norm_2_temp0[B, S1] += layer_norm_2_src[B, S1, N2, H2];
layer_norm_2_temp1[B, S1] += layer_norm_2_src[B, S1, N2, H2] * layer_norm_2_src[B, S1, N2, H2];
layer_output_norm[B, S1, N2, H2] = (layer_norm_2_src[B, S1, N2, H2] * {N * H} - layer_norm_2_temp0[B, S1]) * (layer_norm_2_temp0[B, S1] * {N * H} - layer_norm_2_temp1[B, S1] * layer_norm_2_temp1[B, S1]).call(`max`, [1e-8]).call(`rsqrt`);
For more IR usage or examples, please follow documentation here: Antares IR & Examples
Antares provides JIT plugin for Pytorch/Tensorflow/Tensorflow2 to help frameworks to easily extend new operators, e.g.:
# Tensorflow/Tensorflow2 Example:
op = antares.make_op(ir='dot_0[N, M] +=! data[N, K] * weight[K, M]', feed_dict={'data': x, 'weight': y}).emit()
result_1 = sess.run(op)
print('The custom result_1 is:\n%s' % result_1)
result_2 = sess.run(tf.add(op, op))
print('The custom result_2 is:\n%s' % result_2)
# Pytorch Example:
custom_op = CustomOp(ir='dot_0[N, M] +=! data[N, K] * weight[K, M]', feed_dict={'data': x, 'weight': y}).to(device, dtype).emit()
result = custom_op()
print('The custom result is:', result)
For complete programs, please follow examples here: Antares Examples for Pytorch and Antares Examples for TF/TF2
Senario-1: Quick Start for Developers that Use Antares to Tune Operator/Sub-graph in Foreground Terminal:
- Step-1: Prepare Environment
sudo apt install docker.io
git clone https://github.com/microsoft/antares --branch v0.2.x
cd antares/
# To set the backend type to environment variable `BACKEND` to build the corresponding environment:
echo 'c-cuda' > backend.default
# Build the environment for this backend: (if this step failed, please go to "Pre-dependencies" section to check which "backend-related dependencies" are missing)
make
All valid backends are listed in directory antares/backends
- Step-2: Tune a Specific Workload in Foreground
# Example-1: Run the following command in bash to tune MatMul (4096, 4096) x (4096, 4096) using 2000 trials:
COMMIT=force STEP=2000 COMPUTE_V1='- S = 4096; einstein_v2(input_dict={"input0": {"dtype": "float32", "shape": [S, S]}, "input1": {"dtype": "float32", "shape": [S, S]}}, exprss="output0[N, M] +=! input0[N, K] * input1[K, M]")' make
# Example-2: Run the following command in bash to tune MNIST-inference using 5000 trials:
COMMIT=force STEP=5000 COMPUTE_V1='- einstein_v2(input_dict={"data": {"dtype": "float32", "shape": [64, 784]}, "weight_0": {"dtype": "float32", "shape": [784, 512]}, "weight_1": {"dtype": "float32", "shape": [512, 512]}, "weight_2": {"dtype": "float32", "shape": [512, 10]}, "bias_0": {"dtype": "float32", "shape": [512]}, "bias_1": {"dtype": "float32", "shape": [512]}, "bias_2": {"dtype": "float32", "shape": [10]}}, extra_outputs=[], exprss="data_0[N, M] +=! data[N, K] * weight_0[K, M]; data_1[N, K] = (data_0[N, K] + bias_0[K]).call(`max`, [0.0]); data_2[N, M] +=! data_1[N, K] * weight_1[K, M]; data_3[N, K] = (data_2[N, K] + bias_1[K]).call(`max`, [0.0]); data_4[N, M] +=! data_3[N, K] * weight_2[K, M]; data_5[N, K] = (data_4[N, K] + bias_2[K]);")' make
Apart from detailed reporting logs during the tuning procedure, the best kernel record will be saved to directory antares/codehub. If you don't want to create/overwrite existing kernel record in codehub, environment variable COMMIT=force
in the tuning command can be removed.
Senario-2: Quick Start for Developers that Use Antares to Extend Operator/Sub-graph in Pytorch/Tensorflow (only for CUDA & ROCm backend currently):
-
Step-1: Prepare Environment
You need to follow
Step-1
from Senario-1 to finish environment preparation beforehand. This prevents many environmental issues when walking to the next step. -
Step-2: Set up Background Codegen Service
make rest-server
By default, it listens on TCP port = 8880, and the purpose of this service is to avoid bringing heavy backend-related dependencies in Pytorch/Tensorflow, which helps JIT plugin to be light-weighted.
-
Step-3: Set up a corresponding TF/TF2/Pytorch version that matches your CUDA/ROCm driver version. (If you have installed TF/TF2/Pytorch, please just ignore this step)
Here we provide several prebuilt package sources that match different environment requirements:
For Tensorflow 1.x & 2.x: Recommended Packages (tested in Ubuntu 20.04): # Tensorflow-1 for NVIDIA CUDA 10.0: python3 -m pip install --upgrade pip && python3 -m pip install tensorflow-gpu==1.15.4 # Tensorflow-1 for NVIDIA CUDA 11.0: python3 -m pip install --upgrade pip && python3 -m pip install https://github.com/ghostplant/tensorflow-wheel-collections/releases/download/cuda-11/tensorflow_gpu-1.15.4_cuda11+nv-cp38-cp38-linux_x86_64.whl # Tensorflow-1 for AMD ROCm 4.0: python3 -m pip install tensorflow-rocm==1.15.9 # Tensorflow-2 for NVIDIA CUDA 11.0: python3 -m pip install --upgrade pip && python3 -m pip install tensorflow-gpu==2.4.0 # Tensorflow-2 for AMD ROCm 4.0: python3 -m pip install tensorflow-rocm==2.4.0 For Pytorch 1.x: Recommended Packages (tested in Ubuntu 20.04): # Pytorch for NVIDIA CUDA 10.0: python3 -m pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/torch_stable.html # Pytorch for NVIDIA CUDA 11.0: python3 -m pip install torch===1.7.1+cu110 torchvision===0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html # Pytorch for AMD ROCm 4.0: python3 -m pip install torch torchvision -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html
-
Step-4: Install JIT Plugin Client and Run Examples
# Set up JIT Plugin for Pytorch: sudo python3 ./frameworks/pytorch/setup.py # Set up JIT Plugin for Tensorflow/Tensorflow2: sudo python3 ./frameworks/tensorflow/setup.py # Test Examples for Pytorch: cd ./frameworks/pytorch/examples ./1_hello_world.py # Test Examples for Tensorflow: cd ./frameworks/tensorflow/examples ./1_hello_world.py
More examples here: Antares Examples for Pytorch and Antares Examples for TF/TF2
Before running make
command in antares root directory, you need to ensure the corresponding backend driver is installed correctly.
-
Predependencies for backend
c-cuda
,c-sycl_cuda
:Requirement: Ubuntu >= 18.04
Requirement: Install NVIDIA CUDA toolkit (>= 10.0) on Host OS
Requirement: docker
-
Predependencies for backend
c-ocl_nvidia
:Requirement: Ubuntu >= 18.04
Requirement: Install NVIDIA CUDA toolkit (>= 10.0) to Host OS
Requirement: run bash command "make install_host" in antares root directory beforehand
-
Predependencies for backend
c-ocl_android
:Requirement: Ubuntu >= 18.04
Requirement: Install package "adb", connect to rooted Android device and ensure command "adb shell su -c 'ls /sdcard'" works
Requirement: run bash command "make install_host" in antares root directory beforehand
-
Predependencies for backend
c-rocm
,c-ocl_amdgpu
:Requirement: Ubuntu >= 18.04
Requirement: Install AMD ROCm (>= 4.0) package "rock-dkms" & "rock-dkms-firmware" from repo http://repo.radeon.com/rocm/apt/debian to Host OS
Requirement: docker
-
Predependencies for backend
c-gc
:Requirement: Ubuntu >= 18.04
Requirement: Install Poplar SDK to Host OS, ensure "popc" command exists in system PATH
Requirement: run bash command "make install_host" in antares root directory beforehand
-
Predependencies for backend
c-scpu
,c-mcpu
,c-sycl_intel
:Requirement: Ubuntu >= 18.04
Requirement: docker
-
Predependencies for backend
c-hlsl_win64
,c-hlsl_xbox
:Requirement: Windows 10 64 bit (>= 2004), run "dxdiag.exe" to ensure Direct3D 12.0 Accleration is enabled
Requirement: Windows Subsystem Linux 1.0
How to Install WSL 1.0Requirement: GIT clones antares repo inside WSL environment, and the path of antares directory should be **visible to Windows**, (e.g. "/../c/Users/me/Desktop/antares" would be OK, but "/home/me/antares" won't).
Requirement: run bash command "make install_host" in antares root directory beforehand
-
Predependencies for backend
c-rocm_win64
:Requirement: Windows 10 64 bit (>= 2004)
Requirement: Windows Subsystem Linux 1.0
How to Install WSL 1.0Requirement: Install Official AMD GPU driver (release version >= 2020.11).
EnsureC:\Windows\System32\amdhip64.dll
exists after installation.Requirement: GIT clones antares repo inside WSL environment, and the path of antares directory should be **visible to Windows**, (e.g. "/../c/Users/me/Desktop/antares" would be OK, but "/home/me/antares" won't).
Requirement: run bash command "make install_host" in antares root directory beforehand
-
Predependencies for backend
c-cuda_win64
:Requirement: Windows 10 64 bit (>= 2004)
Requirement: Windows Subsystem Linux 1.0
How to Install WSL 1.0Requirement: Install Official NVIDIA CUDA driver (>= 10.0).
EnsureC:\Windows\System32\nvcuda.dll
exists after installation.Requirement: GIT clones antares repo inside WSL environment, and the path of antares directory should be **visible to Windows**, (e.g. "/../c/Users/me/Desktop/antares" would be OK, but "/home/me/antares" won't).
Requirement: run bash command "make install_host" in antares root directory beforehand
HIP-C(c-rocm/c-rocm_win64) | CUDA(c-cuda/c-cuda_win64) | CPU(c-mcpu/c-scpu) | DirectX12(c-hlsl_win64) | Graphcore(c-gc) | Intel OneAPI(c-sycl_intel) | Codeplay DPCPP (c-sycl_cuda) | |
---|---|---|---|---|---|---|---|
Deploy Environment | Linux/WSL1 | Linux | Linux | WSL1 | Linux | Linux | |
Target Device | AMDGPU | NVGPU | Generic CPU | Generic Graphic Card | IPU Device | Intel CPU/HD Graphic/FPGA | NVGPU |
Global schedules | Y | Y | Y | Y | Y | Y | Y |
Local schedules | Y | Y | Y | Y | Y | Y | |
Head fusion | Y | Y | Y | Y | Y | Y | Y |
Tail fusion | Y | Y | Y | Y | |||
Evaluator | Y | Y | Y | Y | Y | Y | Y |
Tensorflow Plugin | Y | Y | |||||
Pytorch Plugin | Y | Y | |||||
Multi Kernel Eval | Y | Y | Y | Y | Y | Y |
For more information about Microsoft Open Source Policy, please see Microsoft Open Source Code of Conduct