-
Notifications
You must be signed in to change notification settings - Fork 798
[SYCL][Docs] Fix sycl_ext_oneapi_peer_access implementation and extension #19787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
Conversation
…sion This commit fixes the following bugs in the specification and extension: * Changes the exceptions from repeat calls to enabling and disabling to undefined behavior. The implementation did not properly issue the specified exception before. * Implement the exception from attempting to enable access between devices that do not support peer access between them. * Make the access support query return false for backends that do not support it. * Specify and implement the relaxation that the access can be enabled when both devices are the same, even if the backend doesn't support P2P. * Specify that devices need to be from the same platform, rather than with the same backend. * Add the missing feature test macro. Signed-off-by: Larsen, Steffen <[email protected]>
@@ -220,6 +220,9 @@ void device::ext_oneapi_enable_peer_access(const device &peer) { | |||
ur_device_handle_t Device = impl->getHandleRef(); | |||
ur_device_handle_t Peer = peer.impl->getHandleRef(); | |||
if (Device != Peer) { | |||
if (!ext_oneapi_can_access_peer(peer)) | |||
throw sycl::exception(make_error_code(errc::invalid), | |||
"Peer access is not allowed between the devices."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck - If we worry about the additional overhead of doing this check after requiring the user to do it first, we can make it UB instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be feature_not_supported
according to the extension spec?
Are we really worried about performance here? It seems like enabling P2P is something the application will just do once in some initialization code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be
feature_not_supported
according to the extension spec?
Ah, right you are!
Are we really worried about performance here? It seems like enabling P2P is something the application will just do once in some initialization code.
I don't particularly worry about the performance, as you are right that it should mainly be a one-and-done call. It is just redundant when the user is doing it correctly, as they are expected to have already checked the call.
Signed-off-by: Larsen, Steffen <[email protected]>
`errc::feature_not_supported` error code. | ||
|
||
Calling this function with `peer` for which access has already been enabled will | ||
result in undefined behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will our implementation do in this case?
I looked back at the discussion in #6104, and it seems like the CUDA backend will return an error in this case. (Maybe that ends up throwing an exception?) What is the Level Zero behavior?
Same question for the case below when you disable access that was never enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe CUDA will cause a backend error and L0 will just allow it through without issue. The issue with trying to make the L0 case return an error is that the tracking of which P2P pathways are enabled may become somewhat costly, while the user could do it themselves if they are concerned with overlapping enabling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like @pbalcer to weigh in here, because these things will actually go through UR rather than straight to L0.
My understanding is that the L0 adapter in UR will already have to keep track of which P2P pathways are enabled, in order to avoid sharing all USM allocations with all devices (see #19257).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have to track it anyway, I have no objections to moving back to having an exception for this.
@@ -220,6 +220,9 @@ void device::ext_oneapi_enable_peer_access(const device &peer) { | |||
ur_device_handle_t Device = impl->getHandleRef(); | |||
ur_device_handle_t Peer = peer.impl->getHandleRef(); | |||
if (Device != Peer) { | |||
if (!ext_oneapi_can_access_peer(peer)) | |||
throw sycl::exception(make_error_code(errc::invalid), | |||
"Peer access is not allowed between the devices."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be feature_not_supported
according to the extension spec?
Are we really worried about performance here? It seems like enabling P2P is something the application will just do once in some initialization code.
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
This commit fixes the following bugs in the specification and extension: