-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OMP implementation of Thomas algorithm #118
base: main
Are you sure you want to change the base?
Conversation
Issue was in the definition of the |
I think it would be better to move the outer omp parallel loop outside the Lines 39 to 40 in 9ab64bf
This would be useful for the cache-blocked transport equation based on Thomas algorithm. Distributed implementation uses this idea here Lines 110 to 124 in 9ab64bf
It allows reading input arrays only once, and also writing outputs only once. Improves performance when working on multiple operations. Have you had a chance to benchmark the performance? |
Ah, good point, I'll do that. No, only just got this working so not benchmarked yet. |
Initial pass at implementing Thomas algorithm based on the CUDA version, it doesn't yet pass the tests, perhaps a 2nd pair of eyes will spot something.
@semi-h the initial value of
du(i, jm1, b)
on line 159 ofcuda/thomas.f90
would appear to be unset to me and here the error norm of the periodic case is significantly worse. Should it be initialised outside the function or I'm misunderstanding?