-
Notifications
You must be signed in to change notification settings - Fork 797
[SYCL][DOC] Add design document for SYCL-RTC #19698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,317 @@ | ||
# SYCL Runtime Compilation | ||
|
||
SYCL-RTC means using the | ||
[`kernel_compiler`](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_kernel_compiler.asciidoc) | ||
extension to wrap a SYCL source string comprised of kernel definitions in the | ||
[free-function syntax](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_free_function_kernels.asciidoc) | ||
into a `kernel_bundle` in the `ext_oneapi_source` state, which is then compiled | ||
into `exectuable` state by the extension's `build(...)` function. The feature is | ||
backed by an implementation inside the `sycl-jit` library, which exposes the | ||
modular, LLVM-based compiler tech behind DPC++ to be called by the SYCL runtime. | ||
This document gives an overview of the design. | ||
|
||
```c++ | ||
#include <sycl/sycl.hpp> | ||
namespace syclexp = sycl::ext::oneapi::experimental; | ||
|
||
// ... | ||
|
||
std::string sycl_source = R"""( | ||
#include <sycl/sycl.hpp> | ||
|
||
extern "C" SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(( | ||
sycl::ext::oneapi::experimental::nd_range_kernel<1>)) | ||
void vec_add(float* in1, float* in2, float* out){ | ||
size_t id = sycl::ext::oneapi::this_work_item::get_nd_item<1>() | ||
.get_global_linear_id(); | ||
out[id] = in1[id] + in2[id]; | ||
} | ||
)"""; | ||
|
||
sycl::queue q; | ||
|
||
auto source_bundle = syclexp::create_kernel_bundle_from_source( | ||
q.get_context(), syclexp::source_language::sycl, sycl_source); | ||
|
||
// This design document explains what happens on the next line. | ||
auto exec_bundle = syclexp::build(source_bundle); | ||
``` | ||
|
||
## File-and-process-based prototype | ||
|
||
The | ||
[first implementation](https://github.com/intel/llvm/blob/03cb2b25026f060149eb94c85b228e5b3a780588/sycl/source/detail/kernel_compiler/kernel_compiler_sycl.cpp#L254) | ||
of the `build(...)` function wrote the source string into a temporary file, | ||
invoked DPC++ on it with the `-fsycl-dump-device-code` flag to dump the device | ||
code to another file in SPIR-V format, and finally loaded that file back into | ||
the runtime, from where it was executed. | ||
|
||
## The rationale for an in-memory compilation pipeline | ||
|
||
Invoking the DPC++ executable as outlined in the previous section worked | ||
reasonably well to implement the basic `kernel_compiler` extension, but we | ||
observed several shortcomings: | ||
|
||
- Functional completeness: Emitting a single SPIR-V file is sufficient for | ||
simple kernels, but more advanced device code may result in multiple *device | ||
images* comprised of SPIR-V binaries and accompanying metadata (*runtime | ||
properties*) that needs to be communicated to the runtime. | ||
- Robustness: Reading multiple dependent files from a temporary directory can be | ||
be fragile. | ||
- Performance: Multiple processes are launched by the compiler driver, and file | ||
I/O operations have a non-negligible overhead. The `-fsycl-dump-device-code` | ||
required the presence of a dummy `main()` to be added to the source string, | ||
and caused an unnecessary host compilation to be performed. | ||
- Security: Reading executable code from disk is a security concern, and users | ||
of an RTC-enabled application may be unaware that a compilation writing | ||
intermediate files is happening in the background. | ||
|
||
These challenges ultimately motivated the design of the **in-memory compilation | ||
pipeline** based on the `sycl-jit` library which is now the default approach in | ||
DPC++ and the oneAPI product distribution since the 2025.2 release. This new | ||
approach leverages **modular compiler technology** to produce a faster, more | ||
feature-rich, more robust and safer implementation of the `kernel_compiler` | ||
extension. | ||
|
||
The individual steps in the pipeline (frontend, device library linking, | ||
`sycl-post-link` and target format translation) are now invoked programmatically | ||
via an API inside the same process, and intermediate results are passed along as | ||
objects in memory. The code can be found in the | ||
[`compileSYCL(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/RTC.cpp#L57) | ||
function. | ||
|
||
## Using the LibTooling API to compile the source string to an `llvm::Module` | ||
|
||
[LibTooling](https://clang.llvm.org/docs/LibTooling.html) is a high-level API to | ||
write standalone tools based on Clang, such as linters, refactoring tools or | ||
static analysers. To use it, one defines a *tool action* to run on a set of | ||
files in a *virtual filesystem overlay*, which the frontend then processes | ||
according to a *compilation command database*. | ||
|
||
For SYCL-RTC, the filesystem overlay is populated with files containing the | ||
source string and any virtual `include_files` (defined via the homonymous | ||
property). The compilation command is static and puts the frontend into | ||
`-fsycl-device-only` mode. Any user-given options (from the `build_options` | ||
property) are appended. Lastly, the implementation defines a custom tool action | ||
which runs the frontend until LLVM codegen, and then obtains ownership of the | ||
LLVM module. | ||
|
||
This might be a slightly unusual way to use of LibTooling, but we found it works | ||
great for SYCL-RTC. The next sections explain the | ||
[`jit_compiler::compileDeviceCode(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L418) | ||
function in more detail. | ||
|
||
### Step 1: Determine the path of the compiler installation | ||
|
||
To set up up working frontend invocation, we need to know where to find | ||
supplemental files such as the SYCL headers. Normally, these paths are | ||
determined relative to the compiler executable, however in our case, the | ||
executable is actually the RTC-enabled application, which can reside in an | ||
arbitrary location. Instead, we use OS-specific logic inside | ||
[`getDPCPPRoot()`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L112) | ||
to determine the location of the shared library `sycl-jit.so` (or `.dll` on | ||
Windows) which contains the SYCL-RTC implementation. From its location, we can | ||
derive the compiler installation's root directory. | ||
|
||
### Step 2: Collect command-line arguments | ||
|
||
The next step is to collect the command-line arguments for the frontend | ||
invocation. The | ||
[`adjustArgs(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L320) | ||
function relies on Clang's option handling infrastructure to set the required | ||
options to enter the device compilation mode (`-fsycl-device-only`), set up the | ||
compiler environment, and select the target. Finally, any user-specified | ||
arguments passed via the | ||
[`build_options`](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_kernel_compiler.asciidoc#new-properties-for-the-build-and-compile-functions) | ||
property are appended to the list of command-line arguments. | ||
|
||
### Step 3: Configure the `ClangTool` | ||
|
||
Once we know the required command-line arguments, we can set up the compilation | ||
command database and an | ||
[instance](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L433) | ||
of the | ||
[`ClangTool`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/Tooling/Tooling.h#L317) | ||
class, which provides the entry point to the LibTooling interface. As we'll be | ||
translating only a single file containing the source string, we construct a | ||
[`FixedCompilationDatabase`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/Tooling/CompilationDatabase.h#L154) | ||
relative to the current working directory. | ||
|
||
To implement the `kernel_compiler` extension cleanly, we need to capture all | ||
output (e.g. warnings and errors) from the frontend. The | ||
[`ClangDiagnosticsWrapper`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L274) | ||
class configures a | ||
[`TextDiagnosticsPrinter`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/Frontend/TextDiagnosticPrinter.h#L27) | ||
to append all messages to a string maintained by our implementation to collect | ||
all output produced during the runtime compilation. | ||
|
||
The configuration of the `ClangTool` instance continues in the | ||
[`setupTool`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L353) | ||
function. First, we redirect all output to our diagnostics wrapper. Then, we | ||
[set up the overlay | ||
filesystem](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L361-L364) | ||
with a file named `rtc_<n>.cpp` (*n* is incremented for each use of the | ||
`kernel_compiler` extension's `build(...)` function) in the current directory | ||
with the contents of the source string. Each of the virtual header files that | ||
the application defined via the | ||
[`include_files`](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_kernel_compiler.asciidoc#new-properties-for-the-create_kernel_bundle_from_source-function) | ||
property becomes also a file in the overlay filesystem, using the path specified | ||
in the property. | ||
|
||
The `ClangTool` class exposes so-called argument adjusters, which are intended | ||
to modify the command-line arguments coming from the compilation command | ||
database. We have to | ||
[clear](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L368) | ||
the default adjusters defined by the class, because one of them injects the | ||
`-fsyntax-only` flag, which would conflict with the `-fsycl-device-only` flag we | ||
need for SYCL-RTC. Finally, we | ||
[add](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L371) | ||
an argument adjuster ourselves to overwrite the name of executable in the | ||
invocation. Again, this is to help the correct detection of the environment, by | ||
making the invocation as similar as possible to a normal use of DPC++. | ||
|
||
### Step 4: Run an action | ||
|
||
The last step is to define a | ||
[`ToolAction`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/Tooling/Tooling.h#L80) | ||
to be executed on the source files. Clang conveniently provides the | ||
[`EmitLLVMAction`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/CodeGen/CodeGenAction.h#L103), | ||
which runs the frontend up until the LLVM IR code generation, which is exactly | ||
what we need. However, LibTooling does not provides a helper to wrap it in a | ||
`ToolAction`, so we need to define and run our own | ||
[`GetLLVMModuleAction`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L241). | ||
|
||
We extracted common boilerplate code to configure a | ||
[`CompilerInstance`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/Frontend/CompilerInstance.h#L81) | ||
in the | ||
[`RTCActionBase`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L176) | ||
class. Inside the `GetLLVMModuleAction`, we instantiate and execute the | ||
aforementioned `EmitLLVMAction`, and, in case the translation was successful, | ||
[obtains ownership](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L255) | ||
of the constructed `llvm::Module` from it. | ||
|
||
Finally, the call to | ||
[`Action.takeModule()`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L442) | ||
transfers ownership again to the caller of `compileDeviceCode`. Note that this | ||
simple mechanism works because we know that there is only a single compilation | ||
happening for every instance of the `ClangTool` and hence our | ||
`GetLLVMModuleAction` class. | ||
|
||
## Caching | ||
|
||
The implementation optionally uses the runtime's | ||
[persistent cache](https://intel.github.io/llvm/design/KernelProgramCache.html#persistent-cache) | ||
to elide recurring invocations of the frontend, which we observed to be the most | ||
expensive (in terms of runtime overhead) phase of our compilation pipeline. | ||
|
||
### Overall design | ||
|
||
We cache only the frontend invocation, meaning that after a successful | ||
translation, we store the LLVM IR module obtained via LibTooling on disk in the | ||
Bitcode format using built-in utilities. In case of a cache hit in a later | ||
runtime compilation, we load the module from disk and feed it into the device | ||
linking phase. The rationale for this design was that were no utilities to save | ||
and restore the linked and post-processed device images to disk at the time (the | ||
[SYCLBIN](https://intel.github.io/llvm/design/SYCLBINDesign.html) infrastructure | ||
was added later), and caching these steps would have resulted only in marginal | ||
further runtime savings. | ||
|
||
### Cache key considerations | ||
|
||
The main challenge is to define a robust cache key. Because code compiled via | ||
SYCL-RTC can `#include` header files defined via the `include_files` property as | ||
well as from the filesystem, e.g. `sycl.hpp` from the DPC++ installation or user | ||
libraries, it is not sufficient to look only at the source string. In order to | ||
make the cache as conservative as possible (cache collisions are unlikely but | ||
mathematically possible), we decided to compute a hash value of the | ||
*preprocessed* source string, i.e. with all `#include` directives resolved. We | ||
additionally compute a hash value of the rendered command-line arguments, and | ||
append it to the hash of the preprocessed source to obtain the final cache key. | ||
|
||
### Implementation notes | ||
|
||
The cache key computation is implemented in the | ||
[`jit_compiler::calculateHash(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L381) | ||
function. We are again relying on LibTooling to invoke the preprocessor - | ||
handily, Clang provides a | ||
[`PreprocessorFrontendAction`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/include/clang/Frontend/FrontendAction.h#L294) | ||
that we extend to tailor to our use-case. We choose | ||
[BLAKE3](https://en.wikipedia.org/wiki/BLAKE_(hash_function)) as the hash | ||
algorithm because its proven in similar contexts (most notably, | ||
[ccache](https://ccache.dev)) and available as a utility in the LLVM ecosystem. | ||
As the output is a byte array, we apply Base64 encoding to obtain a character | ||
string for use with the persistent cache. | ||
|
||
## Device library linking and SYCL-specific transformations | ||
|
||
With an LLVM IR module in hand, obtained either from the frontend or the cache, | ||
the next steps in the compilation pipeline are simple. | ||
|
||
The device library linking is done by the | ||
[`jit_compiler::linkDeviceLibraries(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L566) | ||
function. These libraries provide primitives for a variety of extra | ||
functionality, such as an extended set of math functions and support for | ||
`bfloat16` arithmetic, and are available as Bitcode files inside the DPC++ | ||
installation or the vendor toolchain, so we just use LLVM utilities to load them | ||
into memory and link them to the module representing the runtime-compiled | ||
kernels. The main challenge here is that the logic to select the device | ||
libraries is currently not reusable from its implementation in the driver, so | ||
our implementation is a simplified copy of the | ||
[`SYCL::getDeviceLibraries(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/clang/lib/Driver/ToolChains/SYCL.cpp#L553) | ||
method, which needs to be kept in sync with the driver code. | ||
|
||
For the SYCL-specific post-processing, implemented in | ||
[`jit_compiler::performPostLink(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/DeviceCompilation.cpp#L750), | ||
we can reuse modular analysis and transformation passes in the | ||
[`SYCLLowerIR`](https://github.com/intel/llvm/tree/sycl/llvm/lib/SYCLLowerIR) | ||
component. The main tasks for the post-processing passes is to split the device | ||
code module into smaller units (either as requested by the user, or required by | ||
the ESIMD mode), and to compute the properties that need to be passed to the | ||
SYCL runtime when the device images are loaded. The logic to orchestrate the | ||
`SYCLLowerIR` passes is adapted from the `sycl-post-link` tool's | ||
[`processInputModule(...)`](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/llvm/tools/sycl-post-link/sycl-post-link.cpp#L606) | ||
function. This duplicated code should be removed as well once a suitable | ||
reusable implementation becomes available. | ||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could mention the code duplication and the plans to remove it in the future here. |
||
## Translation to the target format | ||
|
||
The | ||
[final phase](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/rtc/RTC.cpp#L139) | ||
in the pipeline is to translate the LLVM IR modules resulting from the previous | ||
phase into a device-specific target format that can be handled by the runtime. | ||
For Intel CPUs and GPUs, that's binary SPIR-V. For AMD and NVIDIA GPUs, we emit | ||
AMDGCN and PTX assembly, respectively. Over time, we created our own set of | ||
[utilities](https://github.com/intel/llvm/blob/cc966df07d29db75d07f969f044c0491819bd930/sycl-jit/jit-compiler/lib/translation/Translation.h) | ||
to facilitate the translation. Internally, we dispatch the task to either the | ||
SPIR-V translator (a copy of which is maintained inside the DPC++ repository), | ||
or use vendor-specific backends that are part of LLVM to generate the | ||
third-party GPU code. | ||
|
||
## Third-party hardware support | ||
|
||
SYCL-RTC works for AMD and NVIDIA GPUs, too. The usage of the `kernel_compiler` | ||
extension remains the same for SYCL devices representing such a third-party GPU. | ||
The concrete GPU architecture is queried via the environment variable | ||
`SYCL_JIT_AMDGCN_PTX_TARGET_CPU` when executing the RTC-enabled application. For | ||
AMD GPUs, it is **mandatory** to set it. For NVIDIA GPUs, it is highly | ||
recommended to change it from the conservative default architecture (`sm_50`). | ||
|
||
```shell | ||
$ clang++ -fsycl myapp.cpp -o myapp | ||
$ SYCL_JIT_AMDGCN_PTX_TARGET_CPU=sm_90 ./myapp | ||
``` | ||
|
||
A list of values that can be set as the target CPU can be found in the | ||
[documentation of the `-fsycl-targets=` | ||
option](https://intel.github.io/llvm/UsersManual.html#generic-options) (leave | ||
out the `amd_gpu_` and `nvidia_gpu_` prefixes). | ||
|
||
## Further reading | ||
|
||
- Technical presentation at IWOCL 2025: *Fast In-Memory Runtime Compilation of | ||
SYCL Code*: | ||
[Slides](https://www.iwocl.org/wp-content/uploads/iwocl-2025-julian-oppermann-fast-in-memory-runtime.pdf) | ||
[Video Recording](https://youtu.be/X9mS8xetZJY) | ||
- Blog post: | ||
[*SYCL Runtime Compilation: A New Way to Specialise Kernels Using C++ Metaprogramming*](https://codeplay.com/portal/blogs/2025/07/08/sycl-runtime-compilation) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect clang's PCH to have to deal with the same issue. Have you looked into re-using that infrastructure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, while we did some non-SYCL (due to lack of support at the time) experiments with PCHs, we have not specifically looked into reusing any hashing infra from there.