[SYCL][Graph] Add local memory parameter update functionality #16712

fabiomestre · 2025-01-21T12:43:10Z

Updates the sycl graph specification to add the dynamic_accessor, dynamic_local_accessor and dynamic_work_group_memory classes.

This adds the required functionality to support updating local memory parameters to sycl graph kernel nodes.

Additionally, it also moves the accessor update functionality from the dynamic_parameter class to the new dynamic_accessor class. This improves the cohesion of the API and removes the need to use placeholder accessors when updating buffer arguments in sycl graphs.

Updates the sycl graph specification to add the dynamic_accessor, dynamic_local_accessor and dynamic_work_group_memory classes. This adds the required functionality to support updating local memory parameters to sycl graph kernel nodes. Additionally, it also moves the accessor update functionality from the dynamic_parameter class to the new dynamic_accessor class. This improves the cohesion of the API and removes the need to use placeholder accessors when updating buffer arguments in sycl graphs.

fabiomestre · 2025-01-21T12:50:31Z

@gmlueck These are the proposed spec changes for the new dynamic classes. It would be great if you could review them.

There is also a PR that implements dynamic_local_accessor here: #16573 (review)

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

gmlueck · 2025-01-27T14:55:49Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+        command_graph<graph_state::modifiable> graph,
+        range<Dimensions> allocationSize);
+----
+|Available only when `(Dimensions > 0)`.


Rather than making this a constraint on the constructor, should there just be a requirement for the class that Dimensions is 1, 2, or 3? How do we implement the the restriction that Dimensions is not greater than 3? Is this a static_assert in the class definition?

What you suggest makes sense but I'm still a bit reluctant to deviate too much from the original spec for the local_accessor class. I think it might be a bit confusing.

The implementation there seems to rely on static_assert so we could add something similar. Adding it to the class deifnition makes sense since it makes the error more clear.

Either way is OK with me. I'll point out, though, that you have already deviated from the local_accessor spec by omitting the constructor that works for Dimension == 0. It makes sense to omit that -- I'm not suggesting that you add it back. Since you omitted that constructor, the constraint here that Dimension > 0 is a little weird because there is no alternate constructor that takes Dimension == 0.

That's fair enough. I guess with only one constructor it looks a bit weird. I will move the requirement to the class 👍

Added a static_assert that checks that the dimensions are 1, 2 or 3. And removed the Dimension > 0 from the constructor.

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

gmlueck · 2025-01-27T15:53:31Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+Parameters:
+
+* `graph` - Graph which will contain the nodes that use the dynamic work group memory.
+* `num` - The number of `DataT` elements in the dynamic work group memory.


This wording is not quite right. I think DataT is an unbounded array type in this case, so num specifies the number of elements in that array, not the number of DataT elements in the dynamic_work_group_memory object.

I will update the wording to try to make this more clear 👍

I rephrased it as: "The size of the first dimension of the unbounded array DataT". Since multi-dimensional arrays are allowed, I think that we need to say that it's the first dimension of the array (the unbounded part) that is being updated. But maybe this is not the technically correct way to refer to it. Let me know if you think that the wording needs any further update.

gmlueck · 2025-01-27T17:57:05Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+Registration happens inside the command-group that the node
+represents, and is done when the dynamic parameter is set as a parameter to the
+kernel using `handler::set_arg()`/`handler::set_args()`. It is valid for a node
+argument to be registered with more than one dynamic parameter instance.


Just making sure we are on the same page ...

You can only use set_arg / set_args to set kernel arguments for a free function kernel. Therefore, I think the current specification only supports dynamic parameters for free function kernels. If you agree, I think we should somehow make this more clear in the spec.

In the future, I think we could support dynamic parameters in normal kernels by simply capturing the dynamic_parameter (etc.) object in the lambda.

We should also be a bit more formal about the declaration of the parameter in the kernel function. With dynamic_parameter, I think people were constructing the dynamic_parameter object on the host and passing it to handler::set_arg. However, I think people, declared the type of the argument in the free function kernel as the DataT type, not as dynamic_parameter<DataT>.

For these new types like dynamic_work_group_memory, it seems like the parameter type in the free function kernel should be dynamic_work_group_memory<DataT>, not just DataT. Is that also your expectation?

This difference is not documented in the specification, and the inconsistency seems confusing. Should we change dynamic_parameter to be consistent with the others, requiring the application to define the kernel parameter type as dynamic_parameter<DataT>?

If we do this, I think we need to add a get member function to dynamic_parameter, which returns DataT& (or maybe DataT). That would also make dynamic_parameter more consistent with the other types.

Yes, the only officially supported way of using set_args is with free function kernels. I will update the wording to make that more clear. However, at the moment, free function kernels don't seem to support local_accessor or accessor classes. So we are using the "unofficial" way for internal testing. Otherwise, we have no way to test this new functionality.

I was assuming that, with free function kernels, the type would be that of the underlying object (e.g. work_group_memory). The get() member function requires a handler so I think that it cannot be used inside free function kernels. Without compiler support for the dynamic classes, I think that using the underlying type as a parameter is the only way for it to work. The specification for free function kernels is a bit vague though, so maybe I could be missing something?

Once we have compiler support, and get() functions are no longer required, maybe capturing the dynamic objects in lambdas and using the dynamic classes in free function kernels would be the most consistent way to do things.

Adding a get() member function to the dynamic_parameter class makes sense to me. It makes all the dynamic classes more consistent with each other.

However, at the moment, free function kernels don't seem to support local_accessor or accessor classes.

This might be a reason to not implement dynamic_local_accessor or dynamic_accessor right now. I don't think anyone is asking for them currently.

Without compiler support for the dynamic classes, I think that using the underlying type as a parameter is the only way for it to work. The specification for free function kernels is a bit vague though, so maybe I could be missing something?

Oh, I missed that get takes a handler parameter. I thought the main reason for adding get was to call it from inside the kernel. In fact, it seems dangerous to me to allow get to be called from host code. What is the use case for get as it is defined now?

My thinking was that the user would pass dynamic_accessor (etc.) to set_arg, and also define the kernel to take a parameter of type dynamic_accessor. The kernel would then call get inside the kernel to get an accessor, and the kernel would use the accessor as normal.

The nice thing about this code pattern is that it will also work when the kernel is not a free function.

The current definition for get() was to use it in host code in the following way:

SYCL_EXT_ONEAPI_FUNCTION_PROPERTY((exp_ext::nd_range_kernel<1>)) void ff(work_group_memory<int[]> LocalMem, int *Ptr) { size_t GlobalID = sycl::ext::oneapi::this_work_item::get_nd_item<1>() .get_global_linear_id(); size_t LocalID = sycl::ext::oneapi::this_work_item::get_nd_item<1>().get_local_linear_id(); LocalMem[LocalID] = LocalID; Ptr[GlobalID] = LocalMem[LocalID]; int main() { queue Queue{}; exp_ext::command_graph Graph{Queue.get_context(), Queue.get_device()}; int *PtrA = malloc_shared<int>(Size, Queue); exp_ext::dynamic_work_group_memory<int[]> DynWorkGroupMemory{Graph, LocalMemSize}; exp_ext::dynamic_parameter<int*> DynPtrA(Graph, PtrA); // Lambda auto Node = Graph.add([&](handler &CGH) { work_group_memory<int[]> LocalMem = DynWorkGroupMemory.get(CGH); int*& PtrARef = DynPtrA.get(CGH); CGH.parallel_for(nd_range({Size}, {LocalMemSize}), [=](nd_item<1> Item) { LocalMem[Item.get_local_linear_id()] = Item.get_local_linear_id(); PtrARef[Item.get_global_linear_id()] = LocalMem[Item.get_local_linear_id()]; }); }); // Free Function Kernel kernel_id Kernel_id = exp_ext::get_kernel_id<ff>(); kernel Kernel = Bundle.get_kernel(Kernel_id); auto Node = Graph.add([&](handler &CGH) { CGH.set_arg(0, DynWorkGroupMemory); CGH.set_arg(1, DynPtrA); CGH.parallel_for(nd_range({Size}, {LocalMemSize}), Kernel); }); }

However, after discussing your suggestion with the team, we agree that it might be better to have
the get() method in device code. This makes the API more consistent overall but requires some form of
compiler support (for the new parameter types in free function kernels). So, the example above using the
new API would look like this:

SYCL_EXT_ONEAPI_FUNCTION_PROPERTY((exp_ext::nd_range_kernel<1>)) void ff(dynamic_work_group_memory<int, 1> DynWorkGroupMemory, dynamic_parameter<int*> DynPtr) { work_group_memory<int[]> LocalMem = DynWorkGroupMemory.get(); int *&Ptr = DynPtr.get(); size_t GlobalID = sycl::ext::oneapi::this_work_item::get_nd_item<1>() .get_global_linear_id(); size_t LocalID = sycl::ext::oneapi::this_work_item::get_nd_item<1>().get_local_linear_id(); LocalMem[LocalID] = LocalID; Ptr[GlobalID] = LocalMem[LocalID]; } int main() { queue Queue{}; exp_ext::command_graph Graph{Queue.get_context(), Queue.get_device()}; int *PtrA = malloc_shared<int>(Size, Queue); exp_ext::dynamic_work_group_memory<int, 1> DynWorkGroupMemory{Graph, LocalMemSize}; exp_ext::dynamic_parameter<int*> DynPtrA(Graph, PtrA); auto Node = Graph.add([&](handler &CGH) { CGH.require(DynWorkGroupMemory); CGH.require(DynPtrA); CGH.parallel_for(nd_range({Size}, {LocalMemSize}), [=](nd_item<1> Item) { work_group_memory<int[]>& LocalMem = DynWorkGroupMemory.get(); int*& PtrARef = DynPtrA.get(); LocalMem[Item.get_local_linear_id()] = Item.get_local_linear_id(); PtrARef[Item.get_global_linear_id()] = LocalMem[Item.get_local_linear_id()]; }); }); // Free Function Kernel kernel_id Kernel_id = exp_ext::get_kernel_id<ff>(); kernel Kernel = Bundle.get_kernel(Kernel_id); auto Node = Graph.add([&](handler &CGH) { CGH.set_arg(0, DynWorkGroupMemory); CGH.set_arg(1, DynPtrA); CGH.parallel_for(nd_range({Size}, {LocalMemSize}), Kernel); }); }

We looked into what it would take to add the compiler support for free function kernels and it
seems easier than we initially thought it would be. So we are considering adding support for that
and for removing the need to use set_arg on lambdas as well.

We are also thinking about only commiting an implementation of dynamic_accessor
and dynamic_local_accessor after this compiler support is added. As you rightly pointed out,
those classes don't make sense at the moment. However, once compiler support is added, it would
be possible to officially support them with lambdas.

Oh, I see. I was confused about what you had in mind for get. Given that usage, I think "get" is the wrong name. I'd choose "require" instead, leading to a usage like this:

SYCL_EXT_ONEAPI_FUNCTION_PROPERTY((exp_ext::nd_range_kernel<1>)) void ff(work_group_memory<int[]> LocalMem, int *Ptr) { size_t GlobalID = sycl::ext::oneapi::this_work_item::get_nd_item<1>() .get_global_linear_id(); size_t LocalID = sycl::ext::oneapi::this_work_item::get_nd_item<1>().get_local_linear_id(); LocalMem[LocalID] = LocalID; Ptr[GlobalID] = LocalMem[LocalID]; int main() { queue Queue{}; exp_ext::command_graph Graph{Queue.get_context(), Queue.get_device()}; int *PtrA = malloc_shared<int>(Size, Queue); exp_ext::dynamic_work_group_memory<int[]> DynWorkGroupMemory{Graph, LocalMemSize}; exp_ext::dynamic_parameter<int*> DynPtrA(Graph, PtrA); // Lambda auto Node = Graph.add([&](handler &CGH) { work_group_memory<int[]> LocalMem = DynWorkGroupMemory.require(CGH); int*& PtrARef = DynPtrA.require(CGH); CGH.parallel_for(nd_range({Size}, {LocalMemSize}), [=](nd_item<1> Item) { LocalMem[Item.get_local_linear_id()] = Item.get_local_linear_id(); PtrARef[Item.get_global_linear_id()] = LocalMem[Item.get_local_linear_id()]; }); }); // Free Function Kernel kernel_id Kernel_id = exp_ext::get_kernel_id<ff>(); kernel Kernel = Bundle.get_kernel(Kernel_id); auto Node = Graph.add([&](handler &CGH) { CGH.set_arg(0, DynWorkGroupMemory.require(CGH)); CGH.set_arg(1, DynPtrA.require(CGH)); CGH.parallel_for(nd_range({Size}, {LocalMemSize}), Kernel); }); }

Note how I changed the free function case also to call require. This seems consistent with the lambda case. In both cases, you call require to get the non-dynamic thing, then you pass the non-dynamic thing as a kernel argument. In the lambda, you simply capture the variable to pass it as a parameter. In the free-function case, you need to call set_arg.

Alternatively, we can pass the dynamic thing as a kernel parameter, and then call get inside the kernel, just as you have in your second code example. This is also consistent because the kernel parameter is the dynamic thing for both lambda kernels and free-function kernels.

I could be happy with either of these programming models. Do you have a preference? What are the pros and cons that you see?

Before you answer, let's look at how the free-function kernel case looks with the new sycl_ext_oneapi_enqueue_functions extension. In the first option:

// Free function Kernel with new enqueue functions auto Node = Graph.add([&](handler &CGH) { work_group_memory<int[]> LocalMem = DynWorkGroupMemory.require(CGH); int*& PtrARef = DynPtrA.require(CGH); nd_launch(CGH, nd_range({Size}, {LocalMemSize}), Kernel, LocalMem, PtrARef); });

And in the second option:

// Free function Kernel with new enqueue functions auto Node = Graph.add([&](handler &CGH) { nd_launch(CGH, nd_range({Size}, {LocalMemSize}), Kernel, DynWorkGroupMemory, DynPtrA); });

Note how we eliminate the ugly set_arg calls.

Apologies for the late reply, I was on holidays last week.

I could be happy with either of these programming models. Do you have a preference? What are the pros and cons that you see?

After looking further into what it would take to add compiler support for lambdas, we think that the approach that uses require() in the host code is probably not going to work for that case. To support lambdas, we need to capture the dynamic object in the lambda, so that later we can find which argument index it has based on the information that the compiler provides.

If we only capture the underlying object (as would happen in the version that has require()), the compiler would have no knowledge about the dynamic_parameter argument, preventing us from finding which argument index the dynamic parameter is associated with.

I think that the only option that works for all cases is the one that relies on capturing the dynamic object in the lambda (and using get() in device code).

// Free function Kernel with new enqueue functions auto Node = Graph.add([&](handler &CGH) { nd_launch(CGH, nd_range({Size}, {LocalMemSize}), Kernel, DynWorkGroupMemory, DynPtrA); });

This looks very clean. And I think it is also potentially less confusing to not have to use DynWorkGroupMemory.require(). There is another require() that is a member function of the handler and returns void. So that could be confusing.

@AerialMantis also pointed out that we might need to overload the current handler.require() member function for the new dynamic classes. This would allow supporting library / host only implementations of sycl that cannot rely on extra compiler information for the kernel arguments.

OK, let's go with that approach then. I agree the usage with the new nd_launch looks very clean.

@AerialMantis also pointed out that we might need to overload the current handler.require() member function for the new dynamic classes. This would allow supporting library / host only implementations of sycl that cannot rely on extra compiler information for the kernel arguments.

I would like to avoid this if possible. I also don't see why it would be necessary. You are asserting that a library-only implementation would need an explicit function call to associate the dynamic parameter to the handler:

handler.require(dynamic);

However, it seems like the set_arg call has all the same information, so it could also provide this association information:

handler.set_arg(n, dynamic);

And the new nd_launch syntax also has enough information to associate the dynamic parameter to the handler:

nd_launch(handler, ndr, Kernel, dynamic);

In the latest version, I removed the handler parameter from the get() function and rephrased the spec to make it clear that the get() function should be used in device code. Also removed any references to needing to use set_args() since that is only a requirement for free function kernels / kernel bundles.

gmlueck · 2025-01-27T17:59:07Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

-                  dynamicParamAcc)
+template <typename T>
+void handler::set_arg(int argIndex,
+        ext::oneapi::experimental::dynamic_parameter<T> &dynamicParam);


Why do we need these special overloads of set_arg? The core SYCL specification defines these with an unconstrained template parameter, so you can call them with any kernel parameter type, including dynamic_parameter.

The wording and behaviour of set_arg() in the core SYCL specification is different from the one in the dynamic classes. We could specialize the existing template for the dynamic types but, from an implementation point of view, I'm not sure if that approach would have any advantage compared to what we are doing at the moment. And from a specification point of view, it makes it harder to tell the reader about the special behaviour and exceptions that exist when using set_args() with dynamic classes.

There is also some precedent for using overloads since the raw_kernel_arg extension is doing the same.

OK, that makes sense.

Ohh, I just read this thread again. I think this relates to the other one about "require". If we go with the "require" option, I think we don't need these special overloads of set_arg. All the special semantics about registering the dynamic parameter with the graph node would be handled by dynamic_parameter::require.

Another thing to keep in mind is that the nd_launch API from sycl_ext_oneapi_enqueue_functions does not use set_arg.

As discussed, I deleted all the overloads for set_arg and created a new section that mentions which exceptions the dynamic parameter classes can throw when used in the wrong context.

- Specify new usage for dynamic parameters after compiler support is added. - Update get() function to be used only in device code. - Remove set_arg() overloads - Clarify template parameters limitations and add static_asserts to class definition. - Fix wording of work_group_memory parameters.

gmlueck · 2025-02-14T13:48:14Z

Let me know when you want me to review this again. I saw you pushed another commit, but I wasn't sure if you have more to come.

fabiomestre · 2025-02-14T18:54:53Z

Let me know when you want me to review this again. I saw you pushed another commit, but I wasn't sure if you have more to come.

I pushed another commit after that but it's only updating the usage guide with a few more examples.

Feel free to review the PR again. I think I addressed all the unresolved comments.

gmlueck

I think this looks really good. I have one minor comment below, but I'm approving anyway. If you agree with my comment, it would be nice to address it before merging,

gmlueck · 2025-02-26T21:50:44Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+The type of the underlying object that a dynamic parameter represents is set at
+compile time using the `ValueT` parameter. This underlying type can be a pointer
+to a USM allocation, scalar passed by value, or a raw byte representation of the
+argument. The raw byte representation is intended to enable updating arguments set


kind of a nit, but can't you use this also for a structure parameter? A struct is not a scalar. Maybe the real restriction is that ValyeT should be device copyable?

fabiomestre requested a review from gmlueck January 21, 2025 12:43

gmlueck reviewed Jan 27, 2025

View reviewed changes

EwanC changed the title ~~Add local memory parameter update functionality to sycl graphs.~~ [SYCL][Graph] Add local memory parameter update functionality Jan 28, 2025

EwanC mentioned this pull request Jan 28, 2025

[SYCL][Graph] Implement Dynamic Local Accessors #16573

Draft

Update the usage guide with the new dynamic classes

7361cd4

Fix typos in usage guide examples

e9201c9

gmlueck approved these changes Feb 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][Graph] Add local memory parameter update functionality #16712

[SYCL][Graph] Add local memory parameter update functionality #16712

fabiomestre commented Jan 21, 2025

fabiomestre commented Jan 21, 2025

gmlueck Jan 27, 2025

fabiomestre Jan 28, 2025

gmlueck Jan 28, 2025

fabiomestre Jan 31, 2025

fabiomestre Feb 14, 2025

gmlueck Jan 27, 2025

fabiomestre Jan 28, 2025

fabiomestre Feb 14, 2025

gmlueck Jan 27, 2025

fabiomestre Jan 28, 2025

gmlueck Jan 28, 2025

fabiomestre Jan 31, 2025

gmlueck Jan 31, 2025

fabiomestre Feb 11, 2025 •

edited

Loading

gmlueck Feb 11, 2025

fabiomestre Feb 14, 2025

gmlueck Jan 27, 2025

fabiomestre Jan 28, 2025

gmlueck Jan 28, 2025

gmlueck Jan 31, 2025

fabiomestre Feb 14, 2025

gmlueck commented Feb 14, 2025

fabiomestre commented Feb 14, 2025 •

edited

Loading

gmlueck left a comment

gmlueck Feb 26, 2025

[SYCL][Graph] Add local memory parameter update functionality #16712

Are you sure you want to change the base?

[SYCL][Graph] Add local memory parameter update functionality #16712

Conversation

fabiomestre commented Jan 21, 2025

fabiomestre commented Jan 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabiomestre Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmlueck commented Feb 14, 2025

fabiomestre commented Feb 14, 2025 • edited Loading

gmlueck left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabiomestre Feb 11, 2025 •

edited

Loading

fabiomestre commented Feb 14, 2025 •

edited

Loading