Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory overhead for HIP backends on MI300A GPUs #1734

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

zatkins-dev
Copy link
Collaborator

@zatkins-dev zatkins-dev commented Jan 24, 2025

Prevents double allocations for CeedVector when using HIP vector with unified addressing and XNACK.

Also, updates more of the HIP vector operations to use hipBLAS functions rather than custom kernels.

@@ -114,6 +107,7 @@ static int CeedBasisApplyAtPointsCore_Hip(CeedBasis basis, bool apply_add, const
const CeedScalar *d_x, *d_u;
CeedScalar *d_v;
CeedBasis_Hip *data;
Ceed_Hip *hip_data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another stray

@@ -126,6 +120,7 @@ static int CeedBasisApplyAtPointsCore_Hip(CeedBasis basis, bool apply_add, const
}

CeedCallBackend(CeedBasisGetCeed(basis, &ceed));
CeedCallBackend(CeedGetData(ceed, &hip_data));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

CeedVector_Hip *impl;

CeedCallBackend(CeedVectorGetData(vec, &impl));
CeedCallHip(CeedVectorReturnCeed(vec), hipDeviceSynchronize());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ratel seems to work fine without this line, and is faster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does CeedVectorSyncArray mean that one could immediately start an MPI_Send? If the host doesn't know that the previous kernel (writing to the array) has completed, then it would be racy to call MPI_Send. (Might be rare to trip, but we don't want that kind of bug.)

If our sends are using a kernel for packing (on the same stream), then the host doesn't need to know when the earlier stuff completes, but we still need to sync after the packing kernel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a fair point, I think that we need to be a bit more careful and only sync when the host needs the data. Otherwise this acts as a hard sync with the GPU, which seems to have performance impacts.

@jrwrigh
Copy link
Collaborator

jrwrigh commented Feb 26, 2025

FYI, generally prefer rebase to merge for dev branches. It doesn't matter for squash-merges (the commit history gets nuked anyways), but for normal merges it helps the git history be more regular.

@zatkins-dev
Copy link
Collaborator Author

FYI, generally prefer rebase to merge for dev branches. It doesn't matter for squash-merges (the commit history gets nuked anyways), but for normal merges it helps the git history be more regular.

Yeah generally I agree - I need to strip down this branch and rebuild it probably, it's currently a mess due to changes at the AMD workshop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants