Skip to content

Latest commit

 

History

History
 
 

Rocgdb

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

ROCgdb

NOTE: these exercises have been tested on MI210 and MI300A accelerators using a container environment. To see details on the container environment (such as operating system and modules available) please see README.md on this repo.

We show a simple example on how to use the main features of the ROCm debugger rocgdb.

Saxpy Debugging

Let us consider the saxpy kernel in the HIP examples:

cd HPCTrainingExamples/HIP/saxpy

Get an allocation of a GPU and load software modules:

salloc -N 1 --gpus=1
module load rocm

You can see some information on the GPU you will be running on by doing:

rocm-smi

To introduce an error in your program, comment out the hipMalloc calls at line 71 and 72, then compile with:

mkdir build && cd build
cmake ..
make VERBOSE=1

Running the program, you will see the expected runtime error:

./saxpy
Memory access fault by GPU node-2 (Agent handle: 0x2284d90) on address (nil). Reason: Unknown.
Aborted (core dumped)

To run the code with the rocgdb debugger, do:

rocgdb saxpy

Note that there are also two options for graphical user interfaces that can be turned on by doing:

rocgdb -tui saxpy
cgdb -d rocgdb saxpy 

For the latter command above, you need to have cgdb installed on your system.

In the debugger, type run (or just r) and you will get an error similar to this one:

Thread 3 "saxpy" received signal SIGSEGV, Segmentation fault.
[Switching to thread 3, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
0x00007ffff7ec1094 in saxpy() at saxpy.cpp:57
57    y[i] += a*x[i];

Note that the cmake build type is set to RelWithDebInfo (see line 8 in CMakeLists.txt). With this build type, the debugger will be aware of the debug symbols. If that was not the case (for instance if compiling in Release mode), running the code with the debugger you would get an error message without line info, and also a warning like this one:

Reading symbols from saxpy...
(No debugging symbols found in saxpy)

The error report is at a thread on the GPU. We can display information on the threads by typing info threads (or i th). It is also possible to move to a specific thread with thread <ID> (or t <ID>) and see the location of this thread with where. For instance, if we are interested in the thread with ID 1:

i th
th 1
where

You can add breakpoints with break (or b) followed by the line number. For instance to put a breakpoint right after the hipMalloc lines do b 72.

When possible, it is also advised to compile without optimization flags (so using -O0) to avoid seeing breakpoints placed on lines different than those specified with the breakpoint command.

You can also add a breakpoint directly at the start of the GPU kernel with b saxpy. To run to the next breakpoint, type continue (or c).

To list all the breakpoints that have been inserted type info break (or i b):

(gdb) i b
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000000020b334 in main() at /HPCTrainingExamples/HIP/saxpy/saxpy.hip:74
2       breakpoint     keep y   0x000000000020b350 in main() at /HPCTrainingExamples/HIP/saxpy/saxpy.hip:78

A breakpoint can be removed with delete <Num> (or d <Num>): note that <Num> is the breakpoint ID displayed above. For instance, to remove the breakpoint at line 74, you have to do d 1.

To proceed to the next line you can do next (or n). To step into a function, do step (or s) and to get out do finish. Note that if a breakpoint is at a kernel, doing n or s will switch between different threads. To avoid this behavior, it is necessary to disable the breakpoint at the kernel with disable <Num>.

It is possible to have information on the architecture (below shown on MI250):

(gdb) info agents
  Id State Target Id                  Architecture Device Name                             Cores Threads Location
* 1  A     AMDGPU Agent (GPUID 64146) gfx90a       Aldebaran/MI200 [Instinct MI250X/MI250] 416   3328    29:00.0

We can also get information on the thread grid:

(gdb) info dispatches
  Id   Target Id                      Grid      Workgroup Fence   Kernel Function
* 1    AMDGPU Dispatch 1:1:1 (PKID 0) [256,1,1] [128,1,1] B|Aa|Ra saxpy(int, float const*, int, float*, int)

For the rocgdb documentation, please see: /opt/rocm-<version>/share/doc/rocgdb.