[SYCL][NVPTX][AMDGCN] Move devicelib cmath to header #18706

npmiller · 2025-05-28T15:45:10Z

Overview

Currently to support C++ builtins in SYCL kernels, we rely on libdevice which
provides implementations for standard library builtins. This library is built
either to bitcode or SPIR-V and linked in our kernels.

On some targets this causes issues because clang sometimes turns standard
library calls into LLVM intrinsics that not all targets support. Specifically on
NVPTX and AMDGCN we can't easily support these intrinsics because we currently
use implementations provided by CUDA and HIP in the form of a bitcode library,
which is not something we can use from the LLVM backend.

In upstream LLVM for CUDA and HIP kernels, the way this is handled is that they
have clang headers providing device-side overloads of C++ library functions that
hook into the target specific versions of the builtins (for example std::sin
to __nv_sin). This way on the device side C++ builtins are hijacked before
clang can turn them to intrinsics which solves the issue mentioned above.

This patch is adding the infrastructure to support handling C++ builtins in SYCL
in the same way as it is done for CUDA and HIP in upstream LLVM. And is using it
to support cmath in NVPTX and AMDGCN compilation.

Breakdown

Add sycl_device_only attribute: This new attribute allows functions marked
with it to be treated as device-side overload of existing functions. This is
what allows us to overload C++ library functions for device in SYCL.
Remove clang hack to prevent generating LLVM intrinsics from standard library
builtins for NVPTX and AMDGCN. In theory since this is only moving cmath, the
hack could still be needed, but it looks fine in testing and if we run into
issues we should just move the problematic builtins to this solution. The test
sycl-libdevice-cmath.cpp was testing this hack, so it was removed.
cmath support for NVPTX and AMDGCN in libdevice was disabled. To limit the
scope of the patch libdevice is still fully wired up for these targets, but it
just won't provide the cmath functions.
Added a cmath-fallback.h header providing the device-side math function
overloads. They are defined using SPIR-V builtins, so in theory this header
could be used as-is for other targets.
Use our existing cmath stl wrapper to include cmath-fallback.h for NVPTX
and AMDGCN. In upstream LLVM clang-cuda always includes with -include the
header with these overloads, using the stl wrappers is a bit more selective.
Add rint to device lib tests and stl wrapper, this was added in
[SYCL][Devicelib] Implement cmath rintf wrapper with __spirv_ocl_rint #18857 but wasn't in E2E testing.

Compile-time performance

A quick check of compile-time shows that this seems to provide a small performance improvement. Using two samples, one using cmath (the E2E cmath_test.cpp), and a sample not using cmath, over 10 iterations, I'm getting the following results:

Run	Mean	Stdev
With patch, cmath sample	4.2229s	0.0294s
With patch, no cmath sample	5.7484s	0.0525s
Without patch, cmath sample	4.3817s	0.0424s
Without patch, no cmath sample	5.7941s	0.0452s

Which suggest that the no cmath compile time performance is pretty much equivalent, and the cmath compile-time performance is faster by roughly ~0.12s.

And this is with the whole libdevice setup still in place, so it's possible this approach could be even more beneficial with more work.

Future work

Investigate commented out standard math builtins in cmath-fallback.h, these
weren't defined in libdevice, we should either remove the commented out lines or
implement them properly.
Untangle cmath and math.h, the current cmath-fallback.h implements both
which seems to work fine, but ideally we should split it up.
Deal with nearbyint, this was only implemented for NVPTX and AMDGCN in
libdevice, this patch keeps it the same, but we should look into proper
support and testing for this.
Move more of libdevice into headers (complex, assert, crt, etc ...).
Try this approach for SPIR-V or other targets.

npmiller · 2025-05-28T16:40:44Z

@bader this is a proof of concept for moving C++ library handling from libdevice code into headers. It allows us to remove the hack blocking LLVM intrinsic generation for standard math built-ins, since we intercept them earlier in the header for device side, which is in-line with what clang cuda does. Only for cmath and for Nvidia and AMD for now.

I've currently placed the header into the stl_wrappers directory, it might be better as a clang header, but at least on CUDA the clang header is always included whereas with the stl wrappers it will only be included when the matching standard library header is included.

This still needs a ton of work which is why it's a draft, but let me know if you have any feedback on the approach.

It would be good to know if this would be interesting for non-AOT targets as well, there's a lot of logic in the driver to conditionally link libdevice libraries, I suspect in theory most of that could be replaced with this header approach, but I haven't looked into this much so I'm not 100% sure if this is something we'd want.

bader

@npmiller, thanks for working on this.

It allows us to remove the hack blocking LLVM intrinsic generation for standard math built-ins, since we intercept them earlier in the header for device side, which is in-line with what clang cuda does.

I discussed this approach with Johannes Doerfert a few years ago. He told me that he doesn't like "what clang cuda does" and plans to change it. I think clang still uses the header solution, but it may be worth to double check with LLVM community is doing any work in that direction.

I've currently placed the header into the stl_wrappers directory, it might be better as a clang header, but at least on CUDA the clang header is always included whereas with the stl wrappers it will only be included when the matching standard library header is included.

Interesting... I thought that clang only adds path to the clang headers at the beginning of the search paths list to make sure that clang wrapper header is included before STL one. I didn't know that CUDA compiler always includes clang wrapper headers.

It would be good to know if this would be interesting for non-AOT targets as well, there's a lot of logic in the driver to conditionally link libdevice libraries, I suspect in theory most of that could be replaced with this header approach, but I haven't looked into this much so I'm not 100% sure if this is something we'd want.

@AlexeySachkov, could you take into SPIR-V part, please?

The change looks to be aligned with the community approach. The only concern I have is compile time, but potential increase should be negligible.

cc @Naghasan just to keep in the loop.

clang/lib/AST/Decl.cpp

sycl/include/sycl/stl_wrappers/cmath-fallback.h

Naghasan

don't forget to add tests and documentation for the attribute before undrafting :)

clang/lib/AST/Decl.cpp

clang/lib/Sema/SemaOverload.cpp

clang/include/clang/Basic/Attr.td

npmiller · 2025-05-29T09:43:18Z

Interesting... I thought that clang only adds path to the clang headers at the beginning of the search paths list to make sure that clang wrapper header is included before STL one. I didn't know that CUDA compiler always includes clang wrapper headers.

Yeah in the driver here it does:

  CC1Args.push_back("-include");
  CC1Args.push_back("__clang_cuda_runtime_wrapper.h");

And __clang_cuda_cmath.h is included from that runtime wrapper header here, it also includes <cmath>.

Using our stl wrappers solution should allow us to be a little more conservative about when we include all of this stuff.

bader · 2025-07-14T21:10:29Z

SYCL :: Reduction/reduction_nd_lambda.cpp failure on Windows is not related to the patch. This issue is tracked by #19354.

tahonermann

With this review, I have four outstanding requests:

Fixing a misnamed header guard.
A change to diagnose declarations that have both the sycl_device and sycl_device_only attributes as errors.
A change to diagnose overload sets that contain both sycl_device and sycl_device_only attributed overloads.
The reverse mangled name declaration lookup issue.

The 4th item will take some time and I understand that this feature is needed with some urgency, so I'm ok to defer that one to a later PR and will file an issue to address it after this PR lands. If possible though, I'd like to see the first three issues addressed in this PR.

sycl/include/sycl/stl_wrappers/__sycl_cmath_wrapper_impl.hpp

bader · 2025-07-14T21:58:43Z

A change to diagnose declarations that have both the sycl_device and sycl_device_only attributes as errors.

A change to diagnose overload sets that contain both sycl_device and sycl_device_only attributed overloads.

The reverse mangled name declaration lookup issue.

@tahonermann, are okay to make these changes in follow-up PR(s)?

npmiller · 2025-07-15T13:48:31Z

A change to diagnose declarations that have both the sycl_device and sycl_device_only attributes as errors.

A change to diagnose overload sets that contain both sycl_device and sycl_device_only attributed overloads.

I've added diagnostics for these and tightened up overlapping usages of the attributes, as well as added a lit test for it.

That allowed a small cleanup of the problematic code in EmitGlobal, although as I explained the bulk of it still remains to handle the mangling lookup issue.

I've also added a small check in Sema::MergeDecl to make sure we're not merging declarations with and without the attribute.

Hopefully that resolves the remaining issues and we can go ahead with the PR. And then iterate on what remains to be improved.

tahonermann

Thank you for adding those diagnostics so quickly, @npmiller! This looks good to me. I'll file an issue after this PR is merged to follow up on the reverse mangled name declaration lookup issue.

bader · 2025-07-15T15:17:33Z

Tom has approved this PR, but apparently, he is not member of the @intel/dpcpp-cfe-reviewers team.
@intel/dpcpp-cfe-reviewers, could someone approve to unblock the merge, please?

AaronBallman

LGTM!

bader · 2025-07-15T17:22:01Z

@npmiller, a couple of follow-ups.

I think we should drop -fallback from the cmath-fallback.cpp test name.
libdevice incremental build is broken. When I make change in the sycl/stl_wrappers/cmath header, the header must be copied to "build" directory before we start building libdevice. Otherwise, libdevice uses cmath header from the last successful build. I assume you plan to address this issue by moving cmath header to the clang/lib/Headers. Am I right?

UPD: the second item might be some bug in my environment, please, ignore it. I'll double-check and fix it myself if it's the case.

Follow up from: #18706 * Include `<type_traits>`, this was working because `<cmath>` pulls it in but this is cleaner. * Use C++14 and C++17 `_v` and `_t` helpers. * Switch to variadic template for `__sycl_promote`, using C++17 fold expressions. In theory only needed for 1 to 3 types, but this is an internal helper anyway so it doesn't seem worth restricting. * Switch `typedef` to `using` With this patch the header now require C++17 support to compile. Which is fine since it's the minimum requirement for SYCL 2020, and this header is only compiled on the device side which is always handled by clang, so there shouldn't be any support issues.

AlexeySachkov · 2025-07-17T12:35:41Z

It would be good to know if this would be interesting for non-AOT targets as well, there's a lot of logic in the driver to conditionally link libdevice libraries, I suspect in theory most of that could be replaced with this header approach, but I haven't looked into this much so I'm not 100% sure if this is something we'd want.

@AlexeySachkov, could you take into SPIR-V part, please?

Today, devicelib has "wrappers" part and "fallback" part. "wrappers" intercept C/C++ library calls and redirect them into a unified interface and "fallback" part is supposed to be conditionally linked in case device doesn't support corresponding functionality. "wrappers" are linked unconditionally.

From what I understand, this PR implements an alternative approach to "wrappers" part - where instead of providing a symbol that is expected to come from the standard C/C++ runtime library at device code link time, we provide it directly through headers.

I don't think that it should otherwise change much in how devicelib is handled for non-AOT targets (i.e. conditional linking of "fallback" libraries) and reducing amount of files that we need to redistribute in the compiler package is always a nice thing.
Extra C++ headers do not count, because no one lists all those headers and just recursively copies the directory. But there are folks who repackage lib and bin folders to only include a minimum set of libraries that they need from our releases. So it will be a nice improvement.

npmiller · 2025-07-18T14:13:06Z

It would be good to know if this would be interesting for non-AOT targets as well, there's a lot of logic in the driver to conditionally link libdevice libraries, I suspect in theory most of that could be replaced with this header approach, but I haven't looked into this much so I'm not 100% sure if this is something we'd want.

@AlexeySachkov, could you take into SPIR-V part, please?

Today, devicelib has "wrappers" part and "fallback" part. "wrappers" intercept C/C++ library calls and redirect them into a unified interface and "fallback" part is supposed to be conditionally linked in case device doesn't support corresponding functionality. "wrappers" are linked unconditionally.

From what I understand, this PR implements an alternative approach to "wrappers" part - where instead of providing a symbol that is expected to come from the standard C/C++ runtime library at device code link time, we provide it directly through headers.

I don't think that it should otherwise change much in how devicelib is handled for non-AOT targets (i.e. conditional linking of "fallback" libraries) and reducing amount of files that we need to redistribute in the compiler package is always a nice thing. Extra C++ headers do not count, because no one lists all those headers and just recursively copies the directory. But there are folks who repackage lib and bin folders to only include a minimum set of libraries that they need from our releases. So it will be a nice improvement.

Just to clarify this PR implements both the "wrappers" but also the "fallback" in one step, it doesn't need anything else from libdevice. So in theory enabling this for non-AOT targets would allow us to remove linking both the cmath wrapper and the cmath fallback.

The few things to consider when enabling this for non-AOT targets:

Are __devicelib_ functions called from the wrappers used for anything else or are they always implemented in the devicelib fallback?
Since this is intercepting built-ins earlier it may switch from using clang built-ins to spir-v built-ins for some of them, which could cause performance regression, but this could be tweaked in the header, and it's possible clang built-ins get turned into spir-v built-ins anyway.

AlexeySachkov · 2025-07-21T08:23:21Z

Just to clarify this PR implements both the "wrappers" but also the "fallback" in one step, it doesn't need anything else from libdevice. So in theory enabling this for non-AOT targets would allow us to remove linking both the cmath wrapper and the cmath fallback.

Ah, that's right since we redirect std:: directly into __spirv built-ins. However, we could have a different redirect for SPIR-V path, i.e. to continue to use existing __devicelib if we wanted to keep the "fallback" path.

The few things to consider when enabling this for non-AOT targets:

Are __devicelib_ functions called from the wrappers used for anything else or are they always implemented in the devicelib fallback?

We have a fallback implementation for every __devicelib function. They should not be used directly from anywhere else besides "wrappers", but I can't guarantee that someone did not smuggle something like this through a code review somehow (for whichever reason).

Since this is intercepting built-ins earlier it may switch from using clang built-ins to spir-v built-ins for some of them, which could cause performance regression, but this could be tweaked in the header, and it's possible clang built-ins get turned into spir-v built-ins anyway.

At least the "fallback" implementation (that is used in absolute most cases, if not always) is written using SPIR-V built-ins anyway, because it is supposed to work everywhere. It is actual device backends who are supposed to provide optimized versions. And as I said above, we can do a different redirect in headers to preserve the rest of the devicelib mechanism and just replace the "wrappers" part of it if we want to be conservative.

There is a direction to move away from the complexity that devicelib introduces, because there are only a handful of libraries which are actually provided in their "native" form by one or another backend. I.e. the benefit of having that infrastructure is very small comparing to amount of code and redistributable files that we have. From that point of view its indeed better to replace both "wrappers" and "fallback", dropping the ability to plug-in native implementation

Follow-up from #18706 Initially the header introduced in the other PR was called `cmath-fallback.h`, but we renamed it, so just rename this test as well to something a bit more descriptive.

bader · 2025-07-22T18:15:03Z

Just to clarify this PR implements both the "wrappers" but also the "fallback" in one step, it doesn't need anything else from libdevice. So in theory enabling this for non-AOT targets would allow us to remove linking both the cmath wrapper and the cmath fallback.

The implementation is easier to maintain and deploy. In my opinion, this is a win.

Ah, that's right since we redirect std:: directly into __spirv built-ins. However, we could have a different redirect for SPIR-V path, i.e. to continue to use existing __devicelib if we wanted to keep the "fallback" path.

To get the benefits the "fallback" path we need driver runtimes to implement new extensions, which they are not going to. I don't think it's worth keeping the "fallback" path.

It is actual device backends who are supposed to provide optimized versions.

This is where whole idea breaks. Device backends don't do that and I haven't seen any interest in doing that in the future.

There is a direction to move away from the complexity that devicelib introduces, because there are only a handful of libraries which are actually provided in their "native" form by one or another backend. I.e. the benefit of having that infrastructure is very small comparing to amount of code and redistributable files that we have. From that point of view its indeed better to replace both "wrappers" and "fallback", dropping the ability to plug-in native implementation

That's exactly my point. You can still plug-in native implementation at compile time through macro. Potentially we could do the same at JIT time with the spec constants, but again, I don't see any interest in such a feature form either users or developers.

npmiller had a problem deploying to WindowsCILock May 28, 2025 15:45 — with GitHub Actions Failure

npmiller had a problem deploying to WindowsCILock May 28, 2025 16:35 — with GitHub Actions Failure

bader reviewed May 28, 2025

View reviewed changes

clang/lib/AST/Decl.cpp Outdated Show resolved Hide resolved

sycl/include/sycl/stl_wrappers/cmath-fallback.h Outdated Show resolved Hide resolved

Naghasan reviewed May 28, 2025

View reviewed changes

clang/lib/AST/Decl.cpp Outdated Show resolved Hide resolved

clang/lib/Sema/SemaOverload.cpp Outdated Show resolved Hide resolved

clang/include/clang/Basic/Attr.td Outdated Show resolved Hide resolved

npmiller force-pushed the rip-libdevice branch from 009c38b to 8e11b15 Compare May 29, 2025 15:31

npmiller temporarily deployed to WindowsCILock May 29, 2025 15:31 — with GitHub Actions Inactive

npmiller had a problem deploying to WindowsCILock May 29, 2025 16:14 — with GitHub Actions Error

npmiller temporarily deployed to WindowsCILock May 29, 2025 16:14 — with GitHub Actions Inactive

npmiller temporarily deployed to WindowsCILock May 29, 2025 16:53 — with GitHub Actions Inactive

npmiller temporarily deployed to WindowsCILock May 29, 2025 17:42 — with GitHub Actions Inactive

npmiller force-pushed the rip-libdevice branch from bf6aea1 to 1e4db98 Compare June 9, 2025 16:50

npmiller had a problem deploying to WindowsCILock June 9, 2025 16:50 — with GitHub Actions Error

npmiller force-pushed the rip-libdevice branch from 1e4db98 to c48fb78 Compare June 9, 2025 16:54

npmiller temporarily deployed to WindowsCILock June 9, 2025 16:55 — with GitHub Actions Inactive

npmiller temporarily deployed to WindowsCILock June 9, 2025 17:33 — with GitHub Actions Inactive

npmiller mentioned this pull request Jun 10, 2025

[SYCL][Devicelib] Implement cmath rintf wrapper with __spirv_ocl_rint #18857

Merged

npmiller force-pushed the rip-libdevice branch from c48fb78 to 9318194 Compare June 12, 2025 17:08

npmiller had a problem deploying to WindowsCILock June 12, 2025 17:08 — with GitHub Actions Error

npmiller temporarily deployed to WindowsCILock June 12, 2025 17:13 — with GitHub Actions Inactive

npmiller temporarily deployed to WindowsCILock June 12, 2025 19:41 — with GitHub Actions Inactive

npmiller force-pushed the rip-libdevice branch from 9042d23 to 466518f Compare June 13, 2025 15:31

npmiller had a problem deploying to WindowsCILock June 13, 2025 15:32 — with GitHub Actions Error

npmiller had a problem deploying to WindowsCILock June 13, 2025 15:34 — with GitHub Actions Failure

npmiller had a problem deploying to WindowsCILock June 13, 2025 16:26 — with GitHub Actions Failure

aelovikov-intel approved these changes Jul 14, 2025

View reviewed changes

bader temporarily deployed to WindowsCILock July 14, 2025 21:01 — with GitHub Actions Inactive

bader had a problem deploying to WindowsCILock July 14, 2025 21:01 — with GitHub Actions Failure

tahonermann suggested changes Jul 14, 2025

View reviewed changes

sycl/include/sycl/stl_wrappers/__sycl_cmath_wrapper_impl.hpp Outdated Show resolved Hide resolved

Update the header guard macro name.

be69cdd

bader temporarily deployed to WindowsCILock July 14, 2025 21:57 — with GitHub Actions Inactive

bader temporarily deployed to WindowsCILock July 14, 2025 22:27 — with GitHub Actions Inactive

[SYCL] Make sycl_device and sycl_device_only incompatible

6d52c97

npmiller temporarily deployed to WindowsCILock July 15, 2025 13:32 — with GitHub Actions Inactive

npmiller temporarily deployed to WindowsCILock July 15, 2025 14:30 — with GitHub Actions Inactive

tahonermann approved these changes Jul 15, 2025

View reviewed changes

AaronBallman approved these changes Jul 15, 2025

View reviewed changes

bader merged commit 4f0afd5 into intel:sycl Jul 15, 2025
26 checks passed

This was referenced Jul 16, 2025

[SYCL][NFC] Rename cmath-fallback test #19472

Merged

[SYCL] Refactor cmath wrapper templates #19483

Merged

This was referenced Aug 5, 2025

[CUDA] [SYCL} --fast-math causes nvptx codegen error and/or failed llvm link #7954

Closed

[CUDA] Some math function documented in C-CXX-StandardLibrary meet compile failure on NV backend with -ffast-math option #7344

Closed

[SYCL][NVPTX][AMDGCN] Move devicelib cmath to header #18706

[SYCL][NVPTX][AMDGCN] Move devicelib cmath to header #18706

Uh oh!

Conversation

npmiller commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Breakdown

Compile-time performance

Future work

Uh oh!

npmiller commented May 28, 2025

Uh oh!

bader left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Naghasan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

npmiller commented May 29, 2025

Uh oh!

bader commented Jul 14, 2025

Uh oh!

tahonermann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bader commented Jul 14, 2025

Uh oh!

npmiller commented Jul 15, 2025

Uh oh!

tahonermann left a comment

Choose a reason for hiding this comment

Uh oh!

bader commented Jul 15, 2025

Uh oh!

AaronBallman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bader commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexeySachkov commented Jul 17, 2025

Uh oh!

npmiller commented Jul 18, 2025

Uh oh!

AlexeySachkov commented Jul 21, 2025

Uh oh!

bader commented Jul 22, 2025

Uh oh!

Uh oh!

npmiller commented May 28, 2025 •

edited

Loading

bader commented Jul 15, 2025 •

edited

Loading