Skip to content
/ antares Public
forked from microsoft/antares

Antares: an automatic engine for multi-platform kernel generation and optimization. Supporting CPU, CUDA, ROCm, DirectX12, GraphCore, SYCL, Android GPU backends.


Notifications You must be signed in to change notification settings



Repository files navigation

What is Antares:

Antares is an automatic engine to generate multi-platform kernels with optimization for DNN developers (targeting to backends like CUDA/ROCm/CPU/DirectX12/Graphcore/OneAPI/..). It is also a framework for Hardware developers to extend new backends/hareware quickly and easily. Antares provides IR that follows "One Language Syntax for All Platforms", and general-purpose device access APIs that hide the differences of not only DNN description but also device mapping.

  1. Features

    • Backend Extension
    • Effective Auto Tuning
    • Einsum-based Antares IR
    • Framework JIT Extension (Op Maker Plugin for Pytorch/Tensorflow/Tensorflow2)
  2. How to Use Antares

    • Senario-1: Quick Start for Developers that Use Antares to Tune Operator/Sub-graph in Foreground Terminal
    • Senario-2: Quick Start for Developers that Use Antares to Extend Operator/Sub-graph in Pytorch/Tensorflow
  3. Antares Pre-dependencies for Different Backends

    • Linux-based: cuda, rocm, mcpu, scpu, gc, sycl_intel, sycl_cuda, ocl_amdgpu, ocl_nvidia, ocl_android, ..
    • Windows-based: cuda_win64, rocm_win64, hlsl_win64, ..
  4. About Microsft Open Source

About Antares Features:

a. Backend Extension

The current version of Antares supports code generation for the following backends (in orange blocks) and devices (in black blocks):

b. Effective Auto Tuning

Auto tuning by Antares contributes to not only much less tuning time, but also equivalent or better performance for Intra-op/Inter-op execution (against TVM Ansor).

c. Einsum-based Antares IR

  • Antares IR is the frontend of both kernel generation and automatic optimization.
  • The syntax of Antares IR is slim to describe most MLP/CNN/RNN/LSTM/Transformer based models like MNIST/ResNet/BERT/GPT/..

  E.g. The following computation logic describes a layer of standard BERT transformer:

  merged_layer_local[R, B, S1, N1, H1] +=! input_tensor[B, S1, N, H] * qkv_weight[R, N, H, N1, H1];
  merged_layer_trans[R, B, N1, S1, H1] = merged_layer_local[R, B, S1, N1, H1] + qkv_bias[R, N1, H1];
  attention_scores[B, N1, S1, S2] +=! merged_layer_trans[0, B, N1, S1, H1] * merged_layer_trans[1, B, N1, S2, H1] / const({H}).cast(`float32`);
    softmax_1_temp0[B, N1] >=! attention_scores[B, N1, S1, S2];
    softmax_1_temp1[B, N1] +=! (attention_scores[B, N1, S1, S2] - softmax_1_temp0[B, N1]).call(`exp`);
  attention_probs[B, N1, S1, S2] = (attention_scores[B, N1, S1, S2] - softmax_1_temp0[B, N1]).call(`exp`) / softmax_1_temp1[B, N1];
  ... ...
  layer_norm_2_src[B, S1, N2, H2] = layer_output[B, S1, N2, H2] + attention_output_norm[B, S1, N2, H2];
    layer_norm_2_temp0[B, S1] += layer_norm_2_src[B, S1, N2, H2];
    layer_norm_2_temp1[B, S1] += layer_norm_2_src[B, S1, N2, H2] * layer_norm_2_src[B, S1, N2, H2];
  layer_output_norm[B, S1, N2, H2] = (layer_norm_2_src[B, S1, N2, H2] * {N * H} - layer_norm_2_temp0[B, S1]) * (layer_norm_2_temp0[B, S1] * {N * H} - layer_norm_2_temp1[B, S1] * layer_norm_2_temp1[B, S1]).call(`max`, [1e-8]).call(`rsqrt`);

For more IR usage or examples, please follow documentation here: Antares IR & Examples

d. Pytorch/Tensorflow/Tensorflow2 Op Maker (JIT Plugin)

  Antares provides JIT plugin for Pytorch/Tensorflow/Tensorflow2 to help frameworks to easily extend new operators, e.g.:

# Tensorflow/Tensorflow2 Example:
op = antares.make_op(ir='dot_0[N, M] +=! data[N, K] * weight[K, M]', feed_dict={'data': x, 'weight': y}).emit()
result_1 =
print('The custom result_1 is:\n%s' % result_1)
result_2 =, op))
print('The custom result_2 is:\n%s' % result_2)  

# Pytorch Example:
custom_op = CustomOp(ir='dot_0[N, M] +=! data[N, K] * weight[K, M]', feed_dict={'data': x, 'weight': y}).to(device, dtype).emit()
result = custom_op()
print('The custom result is:', result)

For complete programs, please follow examples here: Antares Examples for Pytorch and Antares Examples for TF/TF2

How to Use Antares?

Senario-1: Quick Start for Developers that Use Antares to Tune Operator/Sub-graph in Foreground Terminal:

  • Step-1: Prepare Environment
sudo apt install
git clone --branch v0.2.x
cd antares/

# To set the backend type to environment variable `BACKEND` to build the corresponding environment:
echo 'c-cuda' > backend.default

# Build the environment for this backend: (if this step failed, please go to "Pre-dependencies" section to check which "backend-related dependencies" are missing)

  All valid backends are listed in directory antares/backends

  • Step-2: Tune a Specific Workload in Foreground
# Example-1: Run the following command in bash to tune MatMul (4096, 4096) x (4096, 4096) using 2000 trials:
COMMIT=force STEP=2000 COMPUTE_V1='- S = 4096; einstein_v2(input_dict={"input0": {"dtype": "float32", "shape": [S, S]}, "input1": {"dtype": "float32", "shape": [S, S]}}, exprss="output0[N, M] +=! input0[N, K] * input1[K, M]")' make

# Example-2: Run the following command in bash to tune MNIST-inference using 5000 trials:
COMMIT=force STEP=5000 COMPUTE_V1='- einstein_v2(input_dict={"data": {"dtype": "float32", "shape": [64, 784]}, "weight_0": {"dtype": "float32", "shape": [784, 512]}, "weight_1": {"dtype": "float32", "shape": [512, 512]}, "weight_2": {"dtype": "float32", "shape": [512, 10]}, "bias_0": {"dtype": "float32", "shape": [512]}, "bias_1": {"dtype": "float32", "shape": [512]}, "bias_2": {"dtype": "float32", "shape": [10]}}, extra_outputs=[], exprss="data_0[N, M] +=!  data[N, K] * weight_0[K, M];   data_1[N, K] =   (data_0[N, K] + bias_0[K]).call(`max`, [0.0]);   data_2[N, M] +=!  data_1[N, K] * weight_1[K, M];   data_3[N, K] =   (data_2[N, K] + bias_1[K]).call(`max`, [0.0]);   data_4[N, M] +=!  data_3[N, K] * weight_2[K, M];   data_5[N, K] =   (data_4[N, K] + bias_2[K]);")' make

  Apart from detailed reporting logs during the tuning procedure, the best kernel record will be saved to directory antares/codehub. If you don't want to create/overwrite existing kernel record in codehub, environment variable COMMIT=force in the tuning command can be removed.

Senario-2: Quick Start for Developers that Use Antares to Extend Operator/Sub-graph in Pytorch/Tensorflow (only for CUDA & ROCm backend currently):

  • Step-1: Prepare Environment

    You need to follow Step-1 from Senario-1 to finish environment preparation beforehand. This prevents many environmental issues when walking to the next step.

  • Step-2: Set up Background Codegen Service

    make rest-server

    By default, it listens on TCP port = 8880, and the purpose of this service is to avoid bringing heavy backend-related dependencies in Pytorch/Tensorflow, which helps JIT plugin to be light-weighted.

  • Step-3: Set up a corresponding TF/TF2/Pytorch version that matches your CUDA/ROCm driver version. (If you have installed TF/TF2/Pytorch, please just ignore this step)

    Here we provide several prebuilt package sources that match different environment requirements:

      For Tensorflow 1.x & 2.x: Recommended Packages (tested in Ubuntu 20.04):
      #   Tensorflow-1 for NVIDIA CUDA 10.0:
      python3 -m pip install --upgrade pip && python3 -m pip install tensorflow-gpu==1.15.4
      #   Tensorflow-1 for NVIDIA CUDA 11.0:
      python3 -m pip install --upgrade pip && python3 -m pip install
      #   Tensorflow-1 for AMD ROCm 4.0:
      python3 -m pip install tensorflow-rocm==1.15.9
      #   Tensorflow-2 for NVIDIA CUDA 11.0:
      python3 -m pip install --upgrade pip && python3 -m pip install tensorflow-gpu==2.4.0
      #   Tensorflow-2 for AMD ROCm 4.0:
      python3 -m pip install tensorflow-rocm==2.4.0
      For Pytorch 1.x: Recommended Packages (tested in Ubuntu 20.04):
      #   Pytorch for NVIDIA CUDA 10.0:
      python3 -m pip install torch==1.5.0 torchvision==0.6.0 -f
      #   Pytorch for NVIDIA CUDA 11.0:
      python3 -m pip install torch===1.7.1+cu110 torchvision===0.8.2+cu110 torchaudio===0.7.2 -f
      #   Pytorch for AMD ROCm 4.0:
      python3 -m pip install torch torchvision -f
  • Step-4: Install JIT Plugin Client and Run Examples

    # Set up JIT Plugin for Pytorch:
    sudo python3 ./frameworks/pytorch/
    # Set up JIT Plugin for Tensorflow/Tensorflow2:
    sudo python3 ./frameworks/tensorflow/
    # Test Examples for Pytorch:
    cd ./frameworks/pytorch/examples
    # Test Examples for Tensorflow:
    cd ./frameworks/tensorflow/examples

  More examples here: Antares Examples for Pytorch and Antares Examples for TF/TF2

Antares Predependencies for Different Backends:

Before running make command in antares root directory, you need to ensure the corresponding backend driver is installed correctly.

  • Predependencies for backend c-cuda, c-sycl_cuda:

    Requirement: Ubuntu >= 18.04

    Requirement: Install NVIDIA CUDA toolkit (>= 10.0) on Host OS

    Requirement: docker

  • Predependencies for backend c-ocl_nvidia:

    Requirement: Ubuntu >= 18.04

    Requirement: Install NVIDIA CUDA toolkit (>= 10.0) to Host OS

    Requirement: run bash command "make install_host" in antares root directory beforehand

  • Predependencies for backend c-ocl_android:

    Requirement: Ubuntu >= 18.04

    Requirement: Install package "adb", connect to rooted Android device and ensure command "adb shell su -c 'ls /sdcard'" works

    Requirement: run bash command "make install_host" in antares root directory beforehand

  • Predependencies for backend c-rocm, c-ocl_amdgpu:

    Requirement: Ubuntu >= 18.04

    Requirement: Install AMD ROCm (>= 4.0) package "rock-dkms" & "rock-dkms-firmware" from repo to Host OS

    Requirement: docker

  • Predependencies for backend c-gc:

    Requirement: Ubuntu >= 18.04

    Requirement: Install Poplar SDK to Host OS, ensure "popc" command exists in system PATH

    Requirement: run bash command "make install_host" in antares root directory beforehand

  • Predependencies for backend c-scpu, c-mcpu, c-sycl_intel:

    Requirement: Ubuntu >= 18.04

    Requirement: docker

  • Predependencies for backend c-hlsl_win64, c-hlsl_xbox:

    Requirement: Windows 10 64 bit (>= 2004), run "dxdiag.exe" to ensure Direct3D 12.0 Accleration is enabled

    Requirement: Windows Subsystem Linux 1.0 How to Install WSL 1.0

    Requirement: GIT clones antares repo inside WSL environment, and the path of antares directory should be **visible to Windows**, (e.g. "/../c/Users/me/Desktop/antares" would be OK, but "/home/me/antares" won't).

    Requirement: run bash command "make install_host" in antares root directory beforehand

  • Predependencies for backend c-rocm_win64:

    Requirement: Windows 10 64 bit (>= 2004)

    Requirement: Windows Subsystem Linux 1.0 How to Install WSL 1.0

    Requirement: Install Official AMD GPU driver (release version >= 2020.11). Ensure C:\Windows\System32\amdhip64.dll exists after installation.

    Requirement: GIT clones antares repo inside WSL environment, and the path of antares directory should be **visible to Windows**, (e.g. "/../c/Users/me/Desktop/antares" would be OK, but "/home/me/antares" won't).

    Requirement: run bash command "make install_host" in antares root directory beforehand

  • Predependencies for backend c-cuda_win64:

    Requirement: Windows 10 64 bit (>= 2004)

    Requirement: Windows Subsystem Linux 1.0 How to Install WSL 1.0

    Requirement: Install Official NVIDIA CUDA driver (>= 10.0). Ensure C:\Windows\System32\nvcuda.dll exists after installation.

    Requirement: GIT clones antares repo inside WSL environment, and the path of antares directory should be **visible to Windows**, (e.g. "/../c/Users/me/Desktop/antares" would be OK, but "/home/me/antares" won't).

    Requirement: run bash command "make install_host" in antares root directory beforehand

Current Support Table:

HIP-C(c-rocm/c-rocm_win64) CUDA(c-cuda/c-cuda_win64) CPU(c-mcpu/c-scpu) DirectX12(c-hlsl_win64) Graphcore(c-gc) Intel OneAPI(c-sycl_intel) Codeplay DPCPP (c-sycl_cuda)
Deploy Environment Linux/WSL1 Linux Linux WSL1 Linux Linux
Target Device AMDGPU NVGPU Generic CPU Generic Graphic Card IPU Device Intel CPU/HD Graphic/FPGA NVGPU
Global schedules Y Y Y Y Y Y Y
Local schedules Y Y Y Y Y Y
Head fusion Y Y Y Y Y Y Y
Tail fusion Y Y Y Y
Evaluator Y Y Y Y Y Y Y
Tensorflow Plugin Y Y
Pytorch Plugin Y Y
Multi Kernel Eval Y Y Y Y Y Y

About Microsft Open Source

For more information about Microsoft Open Source Policy, please see Microsoft Open Source Code of Conduct


Antares: an automatic engine for multi-platform kernel generation and optimization. Supporting CPU, CUDA, ROCm, DirectX12, GraphCore, SYCL, Android GPU backends.







No packages published


  • C++ 45.5%
  • Python 39.7%
  • C 10.8%
  • Shell 1.4%
  • C# 1.2%
  • Makefile 0.7%
  • Other 0.7%