This is a simple project to learn CUDA from matrix multiplication (8192 * 8192 * 8192).
- RTX 3060 Max-Q 6GB on laptop, max power 115W
- CUDA core: 3840
- Max frenquency: 2100MHz
- when running, the actual frenquency is roughly 1920MHz
- Memory bits: 192bit
- Memory frenquency: 6000MHz
Max memory bandwidth:
Max mul-add throughput:
1 FMA is counted as 2 FLOP
matmul time: 17463ms
Throughput: 0.06 TFLOPS (0.43%)
matmul time: 7625ms
Throughput: 0.144 TFLOPS (0.978%)
matmul time: 6674ms
Throughput: 0.165 TFLOPS (1.117%)
matmul time: 3611ms
Throughput: 0.304 TFLOPS (2.065%)
Pre-load part of A
/B
to local memory.
matmul time: 736ms
Throughput: 1.494 TFLOPS (10.131%)
matmul time: 492ms
Throughput: 2.235 TFLOPS (15.156%)
matmul time: 335ms
Throughput: 3.282 TFLOPS (22.258%)
matmul time: 249ms
Throughput: 4.416 TFLOPS (29.946%)
matmul time: 225ms
Throughput: 4.887 TFLOPS (33.140%)
Throughput: 0.874 TFLOPS
(5.925% Max)
(10.619% cuBLAS)
No improvement.
it looks like we make 32 thread a 32xf32 vector, and they do 32 times works for each row in the block.
matmul time: 381.333 ms
Throughput: 2.883 TFLOPS
(19.554% Max)
(35.043% cuBLAS)
After adjusted the load loop, the performance has a little improvement.
matmul time: 347.733 ms
Throughput: 3.162 TFLOPS
(21.443% Max)
(38.429% cuBLAS)
TODO: has bug
Let each thread directly write to global memory, can save 1/3 shared memory. No decrease in performance.
matmul time: 308.433 ms
Throughput: 3.565 TFLOPS
(24.176% Max)
(43.326% cuBLAS)
Block-Warp-Thread, with vectorized gmem load/store.
matmul time: 365.567 ms
Throughput: 3.008 TFLOPS
(20.397% Max)
(36.554% cuBLAS)
matmul time: 174.800 ms
Throughput: 6.290 TFLOPS
(42.658% Max)
(76.448% cuBLAS)
matmul time: 174.467 ms
Throughput: 6.302 TFLOPS
(42.739% Max)
(76.594% cuBLAS)
matmul time: 168.333 ms
Throughput: 6.532 TFLOPS
(44.296% Max)
(79.384% cuBLAS)
matmul time: 163.075 ms
Throughput: 6.742 TFLOPS
(45.725% Max)
(81.944% cuBLAS)
matmul time: 149.475 ms
Throughput: 7.356 TFLOPS
(49.885% Max)
(89.400% cuBLAS)
matmul time: 135.825 ms
Throughput: 8.095 TFLOPS
(54.898% Max)
(98.384% cuBLAS)