Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin-nvidia: collect nvml process metrics #89

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

victoryeagle77
Copy link

@victoryeagle77 victoryeagle77 commented Jan 22, 2025

[ DONE ]

  • Add process_utilization_metrics function, using the NVML "process_utilization_stats" method.
    Allow to pooling running_compute_processes and running_graphics_processes by process PID and LocalMachine GPU consumers, for the final data feedback.

  • The major_utilization metric is now used globally and by PID for graphics and computing processes.
    Used to evaluate frame buffer memory utilization.

  • The encoder_utilization metric is now used globally and by PID for graphics and computing processes.

  • The decoder_utilization metric is now used globally and by PID for graphics and computing processes.

  • The new added metric sm_utilization is used globally and by PID for graphics and computing processes.
    It refers to the percentage of time that the Streaming Multiprocessors (SMs) of a GPU (SM 3D compute utilization).


  • Add GPU temperature monitoring.

  • Refactoring of "nvml.rs" code file in different files in a nvml directory :

    • "device.rs"
    • "features.rs"
    • "metrics.rs"
    • "probe.rs"
  • Add unit tests in nvml directory and "lib.rs"

@TheElectronWill TheElectronWill changed the title Feature/nvml process metrics plugin-nvidia: collect nvml process metrics Jan 23, 2025
@TheElectronWill
Copy link
Contributor

Please rebase on the master branch of upstream. If the upstream remote (alumet-dev/alumet) is named origin in your git config, do:

git fetch origin
git rebase origin/main

Copy link
Contributor

@AngeCyp AngeCyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change the plugin-cgroupv2 in a PR about the NVML plugin ?

@victoryeagle77 victoryeagle77 added the A:existing-plugin area: an existing plugin label Jan 27, 2025
@victoryeagle77 victoryeagle77 force-pushed the feature/nvml_process_metrics branch 3 times, most recently from 52114e5 to d1d4991 Compare January 28, 2025 12:30
Copy link
Contributor

@TheElectronWill TheElectronWill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests need to be removed for now, you can keep them in git stash.

Also, it would be better to make this two separate commits:

  1. add the new measurements
  2. refactor in separate files

The goal is that it's easier to review because we see what has changed in the probe, in the current situation I just see "old file deleted" and "new files added".

But it's not a big deal this time, think about it next time ^^

Copy link
Contributor

@TheElectronWill TheElectronWill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some changes about the documentation. Unlike some other languages, the Rust convention is to write simple sentences that explain what the function does, not to list the parameters or returned result.

@victoryeagle77 victoryeagle77 force-pushed the feature/nvml_process_metrics branch 2 times, most recently from 81a7c10 to d9fe388 Compare January 30, 2025 13:28
@TheElectronWill
Copy link
Contributor

you need to run cargo fmt to format your code

@victoryeagle77 victoryeagle77 force-pushed the feature/nvml_process_metrics branch 2 times, most recently from 506d7ef to 90a89b3 Compare February 3, 2025 15:25
@TheElectronWill
Copy link
Contributor

Please separate the work on jetson in another PR

@victoryeagle77 victoryeagle77 force-pushed the feature/nvml_process_metrics branch from 90a89b3 to c4ba967 Compare February 4, 2025 15:45
@victoryeagle77 victoryeagle77 force-pushed the feature/nvml_process_metrics branch 4 times, most recently from d9a596a to 2caee29 Compare February 11, 2025 13:58
@victoryeagle77 victoryeagle77 force-pushed the feature/nvml_process_metrics branch from 2caee29 to cea07db Compare February 26, 2025 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A:existing-plugin area: an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants