NOTE: these exercises have been tested on MI210 and MI300A accelerators using a container environment.
To see details on the container environment (such as operating system and modules available) please see README.md
on this repo.
We show a simple example on how to use the main features of the ROCm debugger rocgdb
.
Let us consider the saxpy
kernel in the HIP examples:
cd HPCTrainingExamples/HIP/saxpy
Get an allocation of a GPU and load software modules:
salloc -N 1 --gpus=1
module load rocm
You can see some information on the GPU you will be running on by doing:
rocm-smi
To introduce an error in your program, comment out the hipMalloc
calls at line 71 and 72, then compile with:
mkdir build && cd build
cmake ..
make VERBOSE=1
Running the program, you will see the expected runtime error:
./saxpy
Memory access fault by GPU node-2 (Agent handle: 0x2284d90) on address (nil). Reason: Unknown.
Aborted (core dumped)
To run the code with the rocgdb
debugger, do:
rocgdb saxpy
Note that there are also two options for graphical user interfaces that can be turned on by doing:
rocgdb -tui saxpy
cgdb -d rocgdb saxpy
For the latter command above, you need to have cgdb
installed on your system.
In the debugger, type run
(or just r
) and you will get an error similar to this one:
Thread 3 "saxpy" received signal SIGSEGV, Segmentation fault.
[Switching to thread 3, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
0x00007ffff7ec1094 in saxpy() at saxpy.cpp:57
57 y[i] += a*x[i];
Note that the cmake build type is set to RelWithDebInfo
(see line 8 in CMakeLists.txt). With this build type, the debugger will be aware of the debug symbols. If that was not the case (for instance if compiling in Release
mode), running the code with the debugger you would get an error message without line info, and also a warning like this one:
Reading symbols from saxpy...
(No debugging symbols found in saxpy)
The error report is at a thread on the GPU. We can display information on the threads by typing info threads
(or i th
). It is also possible to move to a specific thread with thread <ID>
(or t <ID>
) and see the location of this thread with where
. For instance, if we are interested in the thread with ID 1:
i th
th 1
where
You can add breakpoints with break
(or b
) followed by the line number. For instance to put a breakpoint right after the hipMalloc
lines do b 72
.
When possible, it is also advised to compile without optimization flags (so using -O0
) to avoid seeing breakpoints placed on lines different than those specified with the breakpoint command.
You can also add a breakpoint directly at the start of the GPU kernel with b saxpy
. To run to the next breakpoint, type continue
(or c
).
To list all the breakpoints that have been inserted type info break
(or i b
):
(gdb) i b
Num Type Disp Enb Address What
1 breakpoint keep y 0x000000000020b334 in main() at /HPCTrainingExamples/HIP/saxpy/saxpy.hip:74
2 breakpoint keep y 0x000000000020b350 in main() at /HPCTrainingExamples/HIP/saxpy/saxpy.hip:78
A breakpoint can be removed with delete <Num>
(or d <Num>
): note that <Num>
is the breakpoint ID displayed above. For instance, to remove the breakpoint at line 74, you have to do d 1
.
To proceed to the next line you can do next
(or n
). To step into a function, do step
(or s
) and to get out do finish
. Note that if a breakpoint is at a kernel, doing n
or s
will switch between different threads. To avoid this behavior, it is necessary to disable the breakpoint at the kernel with disable <Num>
.
It is possible to have information on the architecture (below shown on MI250):
(gdb) info agents
Id State Target Id Architecture Device Name Cores Threads Location
* 1 A AMDGPU Agent (GPUID 64146) gfx90a Aldebaran/MI200 [Instinct MI250X/MI250] 416 3328 29:00.0
We can also get information on the thread grid:
(gdb) info dispatches
Id Target Id Grid Workgroup Fence Kernel Function
* 1 AMDGPU Dispatch 1:1:1 (PKID 0) [256,1,1] [128,1,1] B|Aa|Ra saxpy(int, float const*, int, float*, int)
For the rocgdb documentation, please see: /opt/rocm-<version>/share/doc/rocgdb
.