NOTE: these exercises have been tested on MI210 and MI300A accelerators using a container environment.
To see details on the container environment (such as operating system and modules available) please see README.md
on this repo.
We discuss an example on how to use the tools from rocprof
.
First, setup the environment:
salloc --cpus-per-task=8 --mem=0 --ntasks-per-node=4 --gpus=1
module load rocm
Download the examples repo and navigate to the HIPIFY
exercises:
cd ~/HPCTrainingExamples/HIPIFY/mini-nbody/hip/
Update the bash scripts with $ROCM_PATH
:
sed -i 's/\/opt\/rocm/${ROCM_PATH}/g' *.sh
Compile and run the nbody-orig.hip
program (the script below will do both, for several values of nBodies
):
./HIP-nbody-orig.sh
To compile explicitly without make
you can do (considering for example nbody-orig
):
hipcc -I../ -DSHMOO nbody-orig.hip -o nbody-orig
And then run with:
./nbody-orig <nBodies>
The procedure for compiling and running a single example applies to the other programs in the directory. The default value for nBodies
is 30000 for all the examples.
Run rocprof
to obtain the hotspots list (considering for example nbody-orig
):
rocprof --stats --basenames on nbody-orig 65536
In the above command, the --basenames on
flag removes the kernel arguments from the output, for ease of reading. Throughout this example, we will always use 65536 as a value for nBodies
, since nBodies
is used to define the number of work groups in the thread grid:
nBlocks = (nBodies + BLOCK_SIZE - 1) / BLOCK_SIZE
Check results.csv
to find, for each invocation of each kernel, details such as grid size (grd
), workgroup size (wgr
), LDS used (lds
), scratch used if register spilling happened (scr
), number of SGPRs and VGPRs used, etc. Note that grid size is equal to the total number of work-items (threads), not the number of work groups. This is the output that is useful if you allocate shared memory dynamically, for instance.
Additionally, you can check the statistics result file called results.stats.csv
, displayed one line per kernel, sorted in descending order of durations.
You can trace HIP, GPU and Copy activity with --hip-trace
:
rocprof --hip-trace nbody-orig 65536
The output is the file results.hip_stats.csv
, which lists the HIP API calls and their durations, sorted in descending order. This can be useful to find HIP API calls that may be bottlenecks.
You can also profile the HSA API by adding the --hsa-trace
option. This is useful if you are profiling OpenMP target offload code, for instance, as the compiler implements all GPU offloading via the HSA layer:
rocprof --hip-trace --hsa-trace nbody-orig 65536
In addition toresults.hip_stats.csv
, the command above will create the file results.hsa_stats.csv
which contains the statistics information for HSA calls.
The results.json
JSON file produced by rocprof
can be downloaded to your local machine and viewed in Perfetto UI. This file contains the timeline trace for this application, but shows only GPU, Copy and HIP API activity.
Once you have downloaded the file, open a browser and go to https://ui.perfetto.dev/.
Click on Open trace file
in the top left corner.
Navigate to the results.json
you just downloaded.
Use WASD to navigate the GUI
To read about the GPU hardware counters available, inspect the output of the following command:
less $ROCM_PATH/lib/rocprofiler/gfx_metrics.xml
In the output displayed, look for the section associated with the hardware on which you are running (for instance gfx90a).
Create a rocprof_counters.txt
file with the counters you would like to collect, for instance:
touch rocprof_counters.txt
and write this in rocprof_counters.txt
as an example:
pmc : Wavefronts VALUInsts
pmc : SALUInsts SFetchInsts GDSInsts
pmc : MemUnitBusy ALUStalledByLDS
Execute with the counters we just added, including the timestamp on
option which turns on GPU kernel timestamps:
rocprof --timestamp on -i rocprof_counters.txt nbody-orig 65536
You'll notice that rocprof
runs 3 passes, one for each set of counters we have in that file.
View the contents of rocprof_counters.csv
for the collected counter values for each invocation of each kernel:
cat rocprof_counters.csv