OMP implementation of Thomas algorithm #118

pbartholomew08 · 2024-08-06T18:38:15Z

Initial pass at implementing Thomas algorithm based on the CUDA version, it doesn't yet pass the tests, perhaps a 2nd pair of eyes will spot something.

@semi-h the initial value of du(i, jm1, b) on line 159 of cuda/thomas.f90 would appear to be unset to me and here the error norm of the periodic case is significantly worse. Should it be initialised outside the function or I'm misunderstanding?

src/omp/exec_thom.f90

pbartholomew08 · 2024-08-13T23:12:49Z

Issue was in the definition of the jm and jp array indices. Current solution still uses sum() I don't think this should inhibit the omp simd around the outer loop.

semi-h · 2024-08-14T10:49:15Z

I think it would be better to move the outer omp parallel loop outside the der_univ functions like in the distributed algorithm implementation. Then we call these core functions designed to operate on a single group in a loop as in

x3d2/src/omp/exec_dist.f90

Lines 39 to 40 in 9ab64bf

    
           do k = 1, n_groups 
        
             call der_univ_dist( &

This would be useful for the cache-blocked transport equation based on Thomas algorithm. Distributed implementation uses this idea here

x3d2/src/omp/exec_dist.f90

Lines 110 to 124 in 9ab64bf

    
           do k = 1, n_groups 
        
             call der_univ_dist( & 
        
               du(:, :, k), du_send_s(:, :, k), du_send_e(:, :, k), u(:, :, k), & 
        
               u_recv_s(:, :, k), u_recv_e(:, :, k), & 
        
               tdsops_du%coeffs_s, tdsops_du%coeffs_e, tdsops_du%coeffs, & 
        
               n, tdsops_du%dist_fw, tdsops_du%dist_bw, tdsops_du%dist_af & 
        
               ) 
        
             call der_univ_dist( & 
        
               d2u(:, :, k), d2u_send_s(:, :, k), d2u_send_e(:, :, k), u(:, :, k), & 
        
               u_recv_s(:, :, k), u_recv_e(:, :, k), & 
        
               tdsops_d2u%coeffs_s, tdsops_d2u%coeffs_e, tdsops_d2u%coeffs, & 
        
               n, tdsops_d2u%dist_fw, tdsops_d2u%dist_bw, & 
        
               tdsops_d2u%dist_af & 
        
               )

It allows reading input arrays only once, and also writing outputs only once. Improves performance when working on multiple operations.

Have you had a chance to benchmark the performance?

pbartholomew08 · 2024-08-14T10:51:24Z

Ah, good point, I'll do that.

No, only just got this working so not benchmarked yet.

WIP implementation of Thomas algorithm

015b0a1

pbartholomew08 requested review from Nanoseb and semi-h August 6, 2024 18:38

semi-h reviewed Aug 7, 2024

View reviewed changes

src/omp/exec_thom.f90 Show resolved Hide resolved

semi-h reviewed Aug 7, 2024

View reviewed changes

src/omp/exec_thom.f90 Show resolved Hide resolved

pbartholomew08 added 3 commits August 13, 2024 23:29

Correct indexing issue in OMP/Thomas

a04874f

Correct error in calling the Dirichlet test for OMP/Thom

eec85d5

Correct OMP/Thomas periodic indexing

6a12aa8

pbartholomew08 marked this pull request as ready for review August 13, 2024 23:13

fix test for performance measurement

8b2bae7

semi-h added this to the Benchmark OpenMP backend with TGV milestone Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OMP implementation of Thomas algorithm #118

OMP implementation of Thomas algorithm #118

pbartholomew08 commented Aug 6, 2024

pbartholomew08 commented Aug 13, 2024

semi-h commented Aug 14, 2024

pbartholomew08 commented Aug 14, 2024

OMP implementation of Thomas algorithm #118

Are you sure you want to change the base?

OMP implementation of Thomas algorithm #118

Conversation

pbartholomew08 commented Aug 6, 2024

pbartholomew08 commented Aug 13, 2024

semi-h commented Aug 14, 2024

pbartholomew08 commented Aug 14, 2024