Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OMP implementation of Thomas algorithm #118

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

pbartholomew08
Copy link
Member

Initial pass at implementing Thomas algorithm based on the CUDA version, it doesn't yet pass the tests, perhaps a 2nd pair of eyes will spot something.

@semi-h the initial value of du(i, jm1, b) on line 159 of cuda/thomas.f90 would appear to be unset to me and here the error norm of the periodic case is significantly worse. Should it be initialised outside the function or I'm misunderstanding?

@pbartholomew08
Copy link
Member Author

Issue was in the definition of the jm and jp array indices. Current solution still uses sum() I don't think this should inhibit the omp simd around the outer loop.

@pbartholomew08 pbartholomew08 marked this pull request as ready for review August 13, 2024 23:13
@semi-h
Copy link
Member

semi-h commented Aug 14, 2024

I think it would be better to move the outer omp parallel loop outside the der_univ functions like in the distributed algorithm implementation. Then we call these core functions designed to operate on a single group in a loop as in

do k = 1, n_groups
call der_univ_dist( &

This would be useful for the cache-blocked transport equation based on Thomas algorithm. Distributed implementation uses this idea here

x3d2/src/omp/exec_dist.f90

Lines 110 to 124 in 9ab64bf

do k = 1, n_groups
call der_univ_dist( &
du(:, :, k), du_send_s(:, :, k), du_send_e(:, :, k), u(:, :, k), &
u_recv_s(:, :, k), u_recv_e(:, :, k), &
tdsops_du%coeffs_s, tdsops_du%coeffs_e, tdsops_du%coeffs, &
n, tdsops_du%dist_fw, tdsops_du%dist_bw, tdsops_du%dist_af &
)
call der_univ_dist( &
d2u(:, :, k), d2u_send_s(:, :, k), d2u_send_e(:, :, k), u(:, :, k), &
u_recv_s(:, :, k), u_recv_e(:, :, k), &
tdsops_d2u%coeffs_s, tdsops_d2u%coeffs_e, tdsops_d2u%coeffs, &
n, tdsops_d2u%dist_fw, tdsops_d2u%dist_bw, &
tdsops_d2u%dist_af &
)

It allows reading input arrays only once, and also writing outputs only once. Improves performance when working on multiple operations.

Have you had a chance to benchmark the performance?

@pbartholomew08
Copy link
Member Author

Ah, good point, I'll do that.

No, only just got this working so not benchmarked yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants