Skip to content

Commit

Permalink
Adding HAL profiling API and RenderDoc support for Vulkan. (iree-org#…
Browse files Browse the repository at this point in the history
…10893)

TLDR: configure with `-DIREE_ENABLE_RENDERDOC_PROFILING=ON`, pass the
`--device_profiling_mode=queue` flag to the IREE tools, and launch the
tools from the RenderDoc UI in order to get a capture (or use
`renderdoccmd capture`):

![image](https://user-images.githubusercontent.com/75337/197648585-b34bd661-cfd1-4fbb-a6f9-2b73bec81b6a.png)

Things are set up to allow for other profiling modes in the future but
how best to integrate those is TBD. We can figure out how to scale this
with other tooling and on other backends but the rough shape of the API
should be compatible with the various backend APIs we target
(D3D/Metal/CUDA/Vulkan/perf/etc). Note that because RenderDoc will also
capture D3D the cmake flag is generic but both the Vulkan and D3D HAL
implementations will need to load it themselves (no real code worth
sharing as D3D naturally only needs the Windows API query path).

Docs have notes that I've verified on Windows. Someone looking to use
this on Android will need to figure that out and can add what they find.

Fixes iree-org#45. Forty five. Wow.
  • Loading branch information
benvanik authored Oct 25, 2022
2 parents c1348da + 2d79c20 commit 3e9602e
Show file tree
Hide file tree
Showing 21 changed files with 1,157 additions and 3 deletions.
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ include(CMakeDependentOption)

option(IREE_ENABLE_RUNTIME_TRACING "Enables instrumented runtime tracing." OFF)
option(IREE_ENABLE_COMPILER_TRACING "Enables instrumented compiler tracing." OFF)
option(IREE_ENABLE_RENDERDOC_PROFILING "Enables profiling HAL devices with the RenderDoc tool." OFF)
option(IREE_ENABLE_THREADING "Builds IREE in with thread library support." ON)
option(IREE_ENABLE_CLANG_TIDY "Builds IREE in with clang tidy enabled on IREE's libraries." OFF)

Expand Down
27 changes: 27 additions & 0 deletions docs/developers/developing_iree/profiling_vulkan_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,33 @@ like Tracy, vendor-specific tools can be used.

(TODO: add some pictures for each tool)

## RenderDoc

Support for [RenderDoc](https://github.com/baldurk/renderdoc) can be enabled by
configuring cmake with `-DIREE_ENABLE_RENDERDOC_PROFILING=ON`. When built in to
IREE the profiling functionality is available for programmatic use via the
`iree_hal_device_profiling_begin` and `iree_hal_device_profiling_end` APIs.

When using one of the standard IREE tools (`iree-run-module`,
`iree-benchmark-module`, etc) the `--device_profiling_mode=queue` flag can be
passed to enable capture around the entire invocation (be careful when
benchmarking as the recordings can be quite large!). The default capture file
name can be specified with `--device_profiling_file=foo.rdc`.

Capturing in the RenderDoc UI can be done by specifying the IREE tool or
embedding application (`iree-run-module`, etc) as the launch executable and
adding all arguments as normal.

Capturing from the command line can be done using `renderdoccmd` with the
specified file appearing (by default) in the executable directory:

```shell
$ renderdoccmd capture tools/iree-run-module --device_profiling_mode=queue --device_profiling_file=foo.rdc ...
$ stat tools/foo.rdc
$ renderdoccmd capture tools/iree-run-module --device_profiling_mode=queue --device_profiling_file=/some/path/foo.rdc ...
$ stat /some/path/foo.rdc
```

## Android GPUs

There are multiple GPU vendors for the Android platforms, each offering their
Expand Down
15 changes: 15 additions & 0 deletions experimental/rocm/rocm_device.c
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,19 @@ static iree_status_t iree_hal_rocm_device_wait_semaphores(
"semaphore not implemented");
}

static iree_status_t iree_hal_rocm_device_profiling_begin(
iree_hal_device_t* device,
const iree_hal_device_profiling_options_t* options) {
// Unimplemented (and that's ok).
return iree_ok_status();
}

static iree_status_t iree_hal_rocm_device_profiling_end(
iree_hal_device_t* device) {
// Unimplemented (and that's ok).
return iree_ok_status();
}

static const iree_hal_device_vtable_t iree_hal_rocm_device_vtable = {
.destroy = iree_hal_rocm_device_destroy,
.id = iree_hal_rocm_device_id,
Expand All @@ -324,4 +337,6 @@ static const iree_hal_device_vtable_t iree_hal_rocm_device_vtable = {
.queue_execute = iree_hal_rocm_device_queue_execute,
.queue_flush = iree_hal_rocm_device_queue_flush,
.wait_semaphores = iree_hal_rocm_device_wait_semaphores,
.profiling_begin = iree_hal_rocm_device_profiling_begin,
.profiling_end = iree_hal_rocm_device_profiling_end,
};
21 changes: 21 additions & 0 deletions runtime/src/iree/hal/device.c
Original file line number Diff line number Diff line change
Expand Up @@ -276,3 +276,24 @@ IREE_API_EXPORT iree_status_t iree_hal_device_wait_semaphores(
IREE_TRACE_ZONE_END(z0);
return status;
}

IREE_API_EXPORT iree_status_t iree_hal_device_profiling_begin(
iree_hal_device_t* device,
const iree_hal_device_profiling_options_t* options) {
IREE_ASSERT_ARGUMENT(device);
IREE_ASSERT_ARGUMENT(options);
IREE_TRACE_ZONE_BEGIN(z0);
iree_status_t status =
_VTABLE_DISPATCH(device, profiling_begin)(device, options);
IREE_TRACE_ZONE_END(z0);
return status;
}

IREE_API_EXPORT iree_status_t
iree_hal_device_profiling_end(iree_hal_device_t* device) {
IREE_ASSERT_ARGUMENT(device);
IREE_TRACE_ZONE_BEGIN(z0);
iree_status_t status = _VTABLE_DISPATCH(device, profiling_end)(device);
IREE_TRACE_ZONE_END(z0);
return status;
}
71 changes: 71 additions & 0 deletions runtime/src/iree/hal/device.h
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,39 @@ typedef struct iree_hal_device_info_t {
iree_string_view_t name;
} iree_hal_device_info_t;

// Defines what information is captured during profiling.
// Not all implementations will support all modes.
enum iree_hal_device_profiling_mode_bits_t {
IREE_HAL_DEVICE_PROFILING_MODE_NONE = 0u,

// Capture queue operations such as command buffer submissions and the
// transfer/dispatch commands within them. This gives a high-level overview
// of HAL API usage with minimal overhead.
IREE_HAL_DEVICE_PROFILING_MODE_QUEUE_OPERATIONS = 1u << 0,

// Capture aggregated dispatch performance counters across all commands within
// the profiled range.
IREE_HAL_DEVICE_PROFILING_MODE_DISPATCH_COUNTERS = 1u << 1,

// Capture detailed executable performance counters correlated to source
// locations. This can have a significant performance impact and should only
// be used when investigating the performance of an individual dispatch.
IREE_HAL_DEVICE_PROFILING_MODE_EXECUTABLE_COUNTERS = 1u << 2,
};
typedef uint32_t iree_hal_device_profiling_mode_t;

// Controls profiling options.
typedef struct iree_hal_device_profiling_options_t {
// Defines what kind of profiling information is captured.
iree_hal_device_profiling_mode_t mode;

// A file system path where profile data will be written if supported by the
// profiling implementation. Depending on the tool this may be a template
// path/prefix for a unique per capture name or a full path that will be
// overwritten each capture.
const char* file_path;
} iree_hal_device_profiling_options_t;

// A transfer source or destination.
typedef struct iree_hal_transfer_buffer_t {
// A host-allocated void* buffer.
Expand Down Expand Up @@ -381,6 +414,39 @@ IREE_API_EXPORT iree_status_t iree_hal_device_wait_semaphores(
iree_hal_device_t* device, iree_hal_wait_mode_t wait_mode,
const iree_hal_semaphore_list_t semaphore_list, iree_timeout_t timeout);

// Begins a profile capture on |device| with the given |options|.
// This will use an implementation-defined profiling API to capture all
// supported device operations until the iree_hal_device_profiling_end is
// called. If the device or current build configuration do not support profiling
// this method is a no-op. See implementation-specific device creation APIs and
// driver module registration for more information.
//
// WARNING: the device must be idle before calling this method. Behavior is
// undefined if there are any in-flight or pending queue operations or access
// from another thread while profiling is starting/stopping.
//
// WARNING: profiling in any mode can dramatically increase overhead with some
// modes being significantly more expensive in both host and device time enough
// to invalidate performance numbers from other mechanisms (perf/tracy/etc).
// When measuring end-to-end performance use only
// IREE_HAL_DEVICE_PROFILING_MODE_QUEUE_OPERATIONS.
//
// Examples of APIs this maps to (where supported):
// - CPU: perf_event_open/close or vendor APIs
// - CUDA: cuProfilerStart/cuProfilerStop
// - Direct3D: PIXBeginCapture/PIXEndCapture
// - Metal: [MTLCaptureManager startCapture/stopCapture]
// - Vulkan: vkAcquireProfilingLockKHR/vkReleaseProfilingLockKHR +
// RenderDoc StartFrameCapture/EndFrameCapture
IREE_API_EXPORT iree_status_t iree_hal_device_profiling_begin(
iree_hal_device_t* device,
const iree_hal_device_profiling_options_t* options);

// Ends a profile previous started with iree_hal_device_profiling_begin.
// The device must be idle before calling this method.
IREE_API_EXPORT iree_status_t
iree_hal_device_profiling_end(iree_hal_device_t* device);

//===----------------------------------------------------------------------===//
// iree_hal_device_t implementation details
//===----------------------------------------------------------------------===//
Expand Down Expand Up @@ -468,6 +534,11 @@ typedef struct iree_hal_device_vtable_t {
iree_status_t(IREE_API_PTR* wait_semaphores)(
iree_hal_device_t* device, iree_hal_wait_mode_t wait_mode,
const iree_hal_semaphore_list_t semaphore_list, iree_timeout_t timeout);

iree_status_t(IREE_API_PTR* profiling_begin)(
iree_hal_device_t* device,
const iree_hal_device_profiling_options_t* options);
iree_status_t(IREE_API_PTR* profiling_end)(iree_hal_device_t* device);
} iree_hal_device_vtable_t;
IREE_HAL_ASSERT_VTABLE_LAYOUT(iree_hal_device_vtable_t);

Expand Down
16 changes: 16 additions & 0 deletions runtime/src/iree/hal/drivers/cuda/cuda_device.c
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,20 @@ static iree_status_t iree_hal_cuda_device_wait_semaphores(
"semaphore not implemented");
}

static iree_status_t iree_hal_cuda_device_profiling_begin(
iree_hal_device_t* device,
const iree_hal_device_profiling_options_t* options) {
// Unimplemented (and that's ok).
// We could hook in to CUPTI here or use the much simpler cuProfilerStart API.
return iree_ok_status();
}

static iree_status_t iree_hal_cuda_device_profiling_end(
iree_hal_device_t* device) {
// Unimplemented (and that's ok).
return iree_ok_status();
}

static const iree_hal_device_vtable_t iree_hal_cuda_device_vtable = {
.destroy = iree_hal_cuda_device_destroy,
.id = iree_hal_cuda_device_id,
Expand All @@ -418,4 +432,6 @@ static const iree_hal_device_vtable_t iree_hal_cuda_device_vtable = {
.queue_execute = iree_hal_cuda_device_queue_execute,
.queue_flush = iree_hal_cuda_device_queue_flush,
.wait_semaphores = iree_hal_cuda_device_wait_semaphores,
.profiling_begin = iree_hal_cuda_device_profiling_begin,
.profiling_end = iree_hal_cuda_device_profiling_end,
};
22 changes: 22 additions & 0 deletions runtime/src/iree/hal/drivers/local_sync/sync_device.c
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,26 @@ static iree_status_t iree_hal_sync_device_wait_semaphores(
semaphore_list, timeout);
}

static iree_status_t iree_hal_sync_device_profiling_begin(
iree_hal_device_t* device,
const iree_hal_device_profiling_options_t* options) {
// Unimplemented (and that's ok).
// We could hook in to vendor APIs (Intel/ARM/etc) or generic perf infra:
// https://man7.org/linux/man-pages/man2/perf_event_open.2.html
// Capturing things like:
// PERF_COUNT_HW_CPU_CYCLES / PERF_COUNT_HW_INSTRUCTIONS
// PERF_COUNT_HW_CACHE_REFERENCES / PERF_COUNT_HW_CACHE_MISSES
// etc
// TODO(benvanik): shared iree/hal/local/profiling implementation of this.
return iree_ok_status();
}

static iree_status_t iree_hal_sync_device_profiling_end(
iree_hal_device_t* device) {
// Unimplemented (and that's ok).
return iree_ok_status();
}

static const iree_hal_device_vtable_t iree_hal_sync_device_vtable = {
.destroy = iree_hal_sync_device_destroy,
.id = iree_hal_sync_device_id,
Expand All @@ -395,4 +415,6 @@ static const iree_hal_device_vtable_t iree_hal_sync_device_vtable = {
.queue_execute = iree_hal_sync_device_queue_execute,
.queue_flush = iree_hal_sync_device_queue_flush,
.wait_semaphores = iree_hal_sync_device_wait_semaphores,
.profiling_begin = iree_hal_sync_device_profiling_begin,
.profiling_end = iree_hal_sync_device_profiling_end,
};
22 changes: 22 additions & 0 deletions runtime/src/iree/hal/drivers/local_task/task_device.c
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,26 @@ static iree_status_t iree_hal_task_device_wait_semaphores(
&device->large_block_pool);
}

static iree_status_t iree_hal_task_device_profiling_begin(
iree_hal_device_t* device,
const iree_hal_device_profiling_options_t* options) {
// Unimplemented (and that's ok).
// We could hook in to vendor APIs (Intel/ARM/etc) or generic perf infra:
// https://man7.org/linux/man-pages/man2/perf_event_open.2.html
// Capturing things like:
// PERF_COUNT_HW_CPU_CYCLES / PERF_COUNT_HW_INSTRUCTIONS
// PERF_COUNT_HW_CACHE_REFERENCES / PERF_COUNT_HW_CACHE_MISSES
// etc
// TODO(benvanik): shared iree/hal/local/profiling implementation of this.
return iree_ok_status();
}

static iree_status_t iree_hal_task_device_profiling_end(
iree_hal_device_t* device) {
// Unimplemented (and that's ok).
return iree_ok_status();
}

static const iree_hal_device_vtable_t iree_hal_task_device_vtable = {
.destroy = iree_hal_task_device_destroy,
.id = iree_hal_task_device_id,
Expand All @@ -414,4 +434,6 @@ static const iree_hal_device_vtable_t iree_hal_task_device_vtable = {
.queue_execute = iree_hal_task_device_queue_execute,
.queue_flush = iree_hal_task_device_queue_flush,
.wait_semaphores = iree_hal_task_device_wait_semaphores,
.profiling_begin = iree_hal_task_device_profiling_begin,
.profiling_end = iree_hal_task_device_profiling_end,
};
9 changes: 9 additions & 0 deletions runtime/src/iree/hal/drivers/vulkan/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -118,3 +118,12 @@ iree_cc_test(
)

### BAZEL_TO_CMAKE_PRESERVES_ALL_CONTENT_BELOW_THIS_LINE ###

# If renderdoc support is enabled we can make use of it in the device.
# Note that we disable this by default as it introduces a backdoor.
if(IREE_ENABLE_RENDERDOC_PROFILING)
target_compile_definitions(iree_hal_drivers_vulkan_vulkan
PUBLIC
"IREE_HAL_VULKAN_HAVE_RENDERDOC=1"
)
endif()
Loading

0 comments on commit 3e9602e

Please sign in to comment.