\pagebreak{}
NOTE: extensive documentation on how to use omnitrace
for the GhostExchange_Array
example is now available as README.md
files in the exercises repo. While the testing has been done on Frontier in that documentation, most of the omnitrace
tools apply in the same way, hence it could provide additional training matieral.
Here, we show how to use omnitrace
tools considering the example in HPCTrainingExamples/HIP/jacobi
.
Setup environment:
module purge
module load omnitrace gcc/13
Next, create a configuration file for omnitrace
:
omnitrace-avail -G ~/omnitrace.cfg
If you do not provide a path to the config file, it will generate one in the current directory: ./omnitrace-config.cfg
. This config file contains several flags that can be modified to turn on or off several options that impact the visualization of the traces in Perfetto. You can see what flags can be included in the config file by doing:
omnitrace-avail --categories omnitrace
To add brief descriptions, use the -bd
option:
omnitrace-avail -bd --categories omnitrace
Note that the list of flags displayed by the commands above may not include all actual flags that can be set in the config.
You can also create a configuration file with description per option. Beware, this is quite verbose:
omnitrace-avail -G ~/omnitrace_all.cfg --all
Next you have to declare that you want to use this configuration file. Note, this is only necessary if you had provided a custom path and/or filename for the config file when you created it.
export OMNITRACE_CONFIG_FILE=~/omnitrace.cfg
Go to the Jacobi code in the examples repo:
cd ~/HPCTrainingExamples/HIP/jacobi
Compile the code:
make
Execute the binary to make sure it runs successfully:
<! --Note: To get rid of Read -1, expected 4136, errno = 1
add --mca pml ucx --mca pml_ucx_tls ib,sm,tcp,self,cuda,rocm
to the mpirun
command line -->
mpirun -np 1 ./Jacobi_hip -g 1 1
Run the code with omnitrace-instrument
to perform runtime instrumentation: this will produce a series of directories whose name is define by the time they were crated. In one of these directories, you can find the wall_clock-<proc_ID>.txt
file, which includes information on the function calls made in the code, such as how many times these calls have been called (COUNT
) and the time in seconds they took in total (SUM
):
mpirun -np 1 omnitrace-instrument -- ./Jacobi_hip -g 1 1
The above command produces a folder called instrumentation
that contains the available.txt
file, which shows all the functions that can be instrumented. To instrument a specific function, include the --function-include <fnc>
option in the omnitrace-instrument
command, for example:
mpirun -np 1 omnitrace-instrument -v 1 -I 'Jacobi_t::Run' 'JacobiIteration' -- ./Jacobi_hip -g 1 1
The output provided by the above command will show that only those functions have bene instrumented:
[...]
[omnitrace][exe] 1 instrumented funcs in JacobiIteration.hip
[omnitrace][exe] 1 instrumented funcs in JacobiRun.hip
[omnitrace][exe] 1 instrumented funcs in Jacobi_hip
[omnitrace][exe] 2 instrumented funcs in librocprofiler-register.so.0.3.0
[...]
Alternatively, you can use the --print-available functions
option as shown below. The --simulate
option will exit after outputting the diagnostics, the - v
option is for verbose output:
(NOTE: the output of the next command may be lengthy, you may want to pipe it to a file using >> out.txt at the end of the line to make searching it easier afterwards.)
mpirun -np 1 omnitrace-instrument -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1
You can create an instrumented binary using omnitrace-instrument
: notice that this doesn't take very long to run:
omnitrace-instrument -o ./Jacobi_hip.inst -- ./Jacobi_hip
Execute the new instrumented binary using the omnitrace-run
command inside mpirun
. This is the recommended way to profile MPI applications as omnitrace
will separate the output files for each rank:
mpirun -np 1 omnitrace-run -- ./Jacobi_hip.inst -g 1 1
To see the list of the instrumented GPU calls, make sure to turn on the OMNITRACE_PROFILE
flag in your config file:
OMNITRACE_PROFILE = true
Running the instrumented binary again, you can see that it generated a few extra files. One of those has a list of instrumented GPU calls and durations of those calls:
cat omnitrace-Jacobi_hip.inst-output/<TIMESTAMP>/roctracer-0.txt
If you get errors when you run an instrumented binary or when you run with runtime instrumentation, add the following options --monochrome -v 2 --debug
and try: this would give you additional debug information to assist you in figuring out where the problem may lie:
mpirun -np 1 omnitrace-run --monochrome -v 1 --debug -- ./Jacobi_hip.inst -g 1 1
Copy the perfetto-trace-0.proto
to your local machine, and using the Chrome browser open the web page https://ui.perfetto.dev/:
scp -i <path/to/ssh/key> -P <port_number> <username>@aac1.amd.com:~/<path/to/proto/file> .
Click Open trace file
and select the .proto
file. Below, you can see an example of how a .proto
file would be visualized on Perfetto:
To see a list of all the counters for all the devices on the node, do:
omnitrace-avail --all
Declare in your configuration file:
OMNITRACE_ROCM_EVENTS = VALUUtilization,FetchSize
Check again:
grep OMNITRACE_ROCM_EVENTS $OMNITRACE_CONFIG_FILE
Run the instrumented binary, and you will observe an output file for each hardware counter specified. You should also see a row for each hardware counter in the perfetto trace generated by Omnitrace.
Note that you do not have to instrument again after making changes to the config file. Just running the instrumented binary picks up the changes you make in the config file. Ensure that the
OMNITRACE_CONFIG_FILE
environment variable is pointing to your config file.
mpirun -np 1 omnitrace-run -- ./Jacobi_hip.inst -g 1 1
The output should show something like this:
...]> Outputting 'omnitrace-Jacobi_hip.inst-output/<TIMESTAMP>/rocprof-device-0-VALUUtilization-0.json'
...]> Outputting 'omnitrace-Jacobi_hip.inst-output/<TIMESTAMP>/rocprof-device-0-VALUUtilization-0.txt'
...]> Outputting 'omnitrace-Jacobi_hip.inst-output/<TIMESTAMP>/rocprof-device-0-FetchSize-0.json'
...]> Outputting 'omnitrace-Jacobi_hip.inst-output/<TIMESTAMP>/rocprof-device-0-FetchSize-0.txt'
If you do not want to see the details for every CPU core, modify the config file to select only what you want to see, say CPU cores 0-2 only:
OMNITRACE_SAMPLING_CPUS = 0-2
Now running the instrumented binary again will show significantly fewer CPU lines in the profile:
mpirun -np 1 omnitrace-run -- ./Jacobi_hip.inst -g 1 1
Run the instrumented binary with multiple ranks. You'll find multiple perfetto-trace-*.proto
files, one for each rank (note that depending on your system it may be necessary to do a salloc
prior to the command below to ensure enough resources ara available):
mpirun -np 2 omnitrace-run -- ./Jacobi_hip.inst -g 2 1
You can visualize them separately in Perfetto
, or combine them using cat
and visualize them in the same Perfetto
window (trace concatenation is not available in all omnitrace
versions):
cat perfetto-trace-0.proto perfetto-trace-1.proto > allprocesses.proto
Set the following in your configuration file:
OMNITRACE_USE_SAMPLING = true
OMNITRACE_SAMPLING_FREQ = 100
Execute the instrumented binary and visualize the perfetto trace:
mpirun -np 1 omnitrace-run -- ./Jacobi_hip.inst -g 1 1
Scroll down to the very bottom to see the sampling output. Those traces will be annotated with a (S)
as well.
Open the wall_clock-0.txt
file:
cat omnitrace-Jacobi_hip.inst-output/<TIMESTAMP>/wall_clock-0.txt
In order to see the kernel durations aggregated in your configuration file, make sure to set in your config file or in the environment:
OMNITRACE_PROFILE = true
OMNITRACE_FLAT_PROFILE = true
Execute the code and check the wall_clock-0.txt
file again. Instead of updating the config file, you can also set the environment variables to achieve the same effect.
OMNITRACE_PROFILE=true OMNITRACE_FLAT_PROFILE=true mpirun -np 1 omnitrace-run -- ./Jacobi_hip.inst -g 1 1