GPUprobe

GPUprobe (GPU probe, GP-uprobe) provides utilities for observability of GPU behavior via their interaction with the Cuda runtime API by leveraging eBPF uprobes.

Motivation

GPU monitoring and debugging traditionally requires either heavyweight profiling tools like Nsight (which significantly impacts performance), or high-level monitoring solutions like DCGM (which lack granular insights into application behavior). This creates a gap for developers who need detailed GPU runtime information without the overhead of full profiling or code instrumentation.

GPUprobe fills this gap by leveraging eBPF to provide:

Real-time memory leak detection at the CUDA runtime level
Kernel launch frequency tracking
Memory bandwidth utilization metrics

The key advantage of GPUprobe's approach is that it requires zero modification to existing code bases. Whether you're running production ML pipelines, handling complex GPU computations, or debugging CUDA applications, GPUprobe can monitor multiple running processes calling the CUDA runtime API and provide granular insights without any changes to your CUDA kernels or application code.

By hooking directly into the CUDA runtime API through eBPF uprobes, GPUprobe maintains a lightweight footprint while still offering detailed observability into GPU behavior - making it suitable for both development and production environments.

This repository provides the source code for gpuprobe-daemon - a lightweight binary that implements these capabilities. While the project is experimental, it already offers several powerful features described below.

For information on building and running, refer to the short guide on the subject.

Usage

Usage: gpu_probe [OPTIONS]

Options:
      --memleak
          Attaches memleak program: detects leaking calls to cudaMalloc from the CUDA runtime API
      --cudatrace
          Attaches the cudatrace program: maintains per-process histograms of cuda kernel launches and their frequencies
      --bandwidth-util
          Attaches the bandwidth util program: approximates bandwidth utilization of cudaMemcpy
      --metrics-addr <METRICS_ADDR>
          Address for the Prometheus metrics endpoint [default: 0.0.0.0:9000]
      --display-interval <DISPLAY_INTERVAL>
          Interval in seconds for displaying metrics to stdout [default: 5]
      --libcudart-path <LIBCUDART_PATH>
          The path of the libcudart.so dynamic lib that is monitored [default: /usr/local/cuda/lib64/libcudart.so]
  -h, --help
          Print help
  -V, --version
          Print version

High-level Design

Metrics are exported in OpenMetrics format via an http handler, which is intended to be scraped by Prometheus. This allows for seamless integration with your favorite observability stack, e.g. Grafana.

These metrics are also displayed periodically to stdout.

2024-12-12 16:32:46

num_successful_mallocs:  6
num_failed_mallocs:      0
num_successful_frees:    2
num_failed_frees:        0
per-process memory maps:
process 365159
        0x0000793a44000000: 8000000 Bytes
        0x0000793a48c00000: 8000000 Bytes
        0x0000793a49400000: 8000000 Bytes
process 365306
        0x000078fd20000000: 8000000 Bytes
        0x000078fd24c00000: 0 Bytes
        0x000078fd25400000: 0 Bytes

total kernel launches: 1490
pid: 365306
        0x5823e39efa50 (unknown kernel) -> 10
        0x5823e39efb30 (unknown kernel) -> 10
pid: 365159
        0x5de98f9fba50 (_Z27optimized_convolution_part1PdS_i) -> 735
        0x5de98f9fbb30 (_Z27optimized_convolution_part2PdS_i) -> 735

The various features are opt-in via command-line arguments passed to the program at launch.

E.g. running gpuprobe --memleak will only attach the uprobes needed for the memleak feature, and only display/export relevant metrics.

Memleak feature

This utility correlates a call to cudaFree() to the associated call to cudaMalloc(), allowing for a measurement of the number of leaked bytes related to a Cuda virtual address.

CudaTrace feature

This utility keeps stats on the launched kernels and number of times that they were launched as a pair (func_addr, count). It can be thought of and aggregated as a histogram of the frequencies of kernel launches.

Bandwidth utilization feature

This feature approximates bandwidth utilization on the bus between host and device as a function of execution time and size of a cudaMemcpy() call.

This is computed naively with: throughput = count / (end - start)

Note that this only plausibly works for host-to-device (H2D) and device-to-host (D2H) copies, as only these calls provide any guarantees of synchronicity.

This feature is not yet exported. Below you will find a sample output of an older iteration that simply wrote the results to stdout.

GPUprobe bandwidth_util utility
========================


Traced 1 cudaMemcpy calls
        H2D 3045740550.87548 bytes/sec for 0.00263 secs
========================

Traced 2 cudaMemcpy calls
        H2D 2981869117.56429 bytes/sec for 0.00268 secs
        D2H 3039108386.38160 bytes/sec for 0.00263 secs
========================

Building and Running

An eBPF compatible Linux kernel version is required for running GPUprobe, as well as bpftool.

A vmlinux.h file is required for the build process, which can be created by executing the following command from the project root:

bpftool btf dump file /sys/kernel/btf/vmlinux format c > src/bpf/vmlinux.h

Following that, you should be able to build the project.

cargo build

Root privileges are required to run the project due to its attaching of eBPF uprobes.

sudo ./gpu_probe # --options

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
readme-assets		readme-assets
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
format.sh		format.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPUprobe

Motivation

Usage

High-level Design

Memleak feature

CudaTrace feature

Bandwidth utilization feature

Building and Running

About

Releases

Packages

Contributors 2

Languages

License

GPUprobe/gpuprobe-daemon

Folders and files

Latest commit

History

Repository files navigation

GPUprobe

Motivation

Usage

High-level Design

Memleak feature

CudaTrace feature

Bandwidth utilization feature

Building and Running

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages