Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assistance with NVIDIA Quadro P4000 Integration in Glances (Docker on TrueNAS Scale) #3096

Open
yeeahnick opened this issue Jan 27, 2025 · 13 comments

Comments

@yeeahnick
Copy link

yeeahnick commented Jan 27, 2025

Hello,

I'm encountering an issue where my NVIDIA Quadro P4000 is not being detected by Glances. I'm using the docker-compose (latest-full) configuration and have enabled NVIDIA GPU support in the application settings while building the app in TrueNAS. This configuration sets the NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES variables.

With these settings, I can see the NVIDIA driver listed under the file system pane in Glances, but the GPU does not appear when I access the endpoint:
http://IP:61208/api/4/gpu.

Interestingly, when I navigate to http://IP:61208/api/4/full, I can see several NVIDIA-related entries.

To ensure the GPU is properly assigned in the Docker Compose configuration, I ran the following command in the TrueNAS shell:

midclt call -job app.update glances-custom '{"values": {"resources": {"gpus": {"use_all_gpus": false, "nvidia_gpu_selection": {"PCI_SLOT": {"use_gpu": true, "uuid": "GPU-95943d54-8d67-b91e-00cb-ca3662cfd863"}}}}}}'

Despite this, the GPU still doesn’t show up in the /gpu endpoint.

Does anyone have suggestions or insights on what might be missing or misconfigured? Any help would be greatly appreciated!

Thank you!

@yeeahnick yeeahnick changed the title Assistance with NVIDIA Quadro P4000 Integration in Glances (Docker on TrueNAS) Assistance with NVIDIA Quadro P4000 Integration in Glances (Docker on TrueNAS Scale) Jan 27, 2025
@nicolargo
Copy link
Owner

Hi @yeeahnick

can you copy/paste the result of a curl on http://ip:61208/api/4/full ?

Thanks.

@yeeahnick
Copy link
Author

Hi @nicolargo

Thanks for the quick response.

Unfortunately the curl /full no longer shows the NVIDIA gpu (same thing under file system in Glances). There was a TrueNAS Scale update (24.10.2) yesterday that included NVIDIA fixes which I guess made it worst for Glances. To be clear my GPU is working in other dockers running on the same system.

But I can give more information.

When I run "ls /dev | grep nvidia" in the shell of Glances I see the following:

nvidia-caps
nvidia-modeset
nvidia-uvm
nvidia-uvm-tools
nvidia0
nviaiactl

When I do a nvidia-smi nothing is found (this works in other dockers on the same system).

When I run "env" in the shell of Glances I see that the NVIDIA capabilities and devices are enabled. (environment variables)

When I run "glances | grep -i runtime" in the shell of Glances it just hangs.

I will fiddle with it again tonight to see if I can repopulate the curl /full.

Let me know if I need to provide anything else.

Cheers!

@nicolargo
Copy link
Owner

In the shell of Glances, can you run the following command:

glances -V

It will display the path to the glances.log file.

then run:

glances -d --stdout gpu --stop-after 3

And copy paste:

  • the glances.log file (relevants lines)
  • output of the command

Thanks !

@XSvirusSAFE
Copy link

XSvirusSAFE commented Jan 30, 2025

Having the same issue here. Hope the info within the screenshot can help. Screenshot_2025-01-30-20-09-19-93_40deb401b9ffe8e1df2f1cc5ba480b12.jpg

@yeeahnick
Copy link
Author

yeeahnick commented Jan 30, 2025

@nicolargo

Image

Image

@yeeahnick
Copy link
Author

yeeahnick commented Jan 30, 2025

Having the same issue here. Hope the info within the screenshot can help.

You can run this "cat /tmp/glances-root.log" in the Glances shell to view the log file.

@kbirger
Copy link

kbirger commented Jan 31, 2025

Same exact results here. I have also noticed that inside the container nvidia-smi reports "not found". Bizarre, because it's there

/app # which nvidia-smi
/usr/bin/nvidia-smi
/app # ls -l /usr/bin/nvidia-smi
-rwxr-xr-x    1 root     root       1068640 Jan 30 04:54 /usr/bin/nvidia-smi
/app # stat /usr/bin/nvidia-smi
  File: /usr/bin/nvidia-smi
  Size: 1068640         Blocks: 2088       IO Block: 4096   regular file
Device: fc06h/64518d    Inode: 9710205     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-01-30 04:55:14.164896111 +0000
Modify: 2025-01-30 04:54:48.715960042 +0000
Change: 2025-01-30 04:54:48.715960042 +0000
/app # nvidia-smi
sh: nvidia-smi: not found
/app # /usr/bin/nvidia-smi
sh: /usr/bin/nvidia-smi: not found

/app # id
uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)

from the root log:

2025-01-31 01:57:40,193 -- DEBUG -- NVML Shared Library (libnvidia-ml.so.1) not Found, Nvidia GPU plugin is disabled

However, I've got other containers on the system that are using the GPU no problem.

Please let me know if you want to see any other parts of the log

@yeeahnick
Copy link
Author

Same with nvidia-smi

Image

/

@nicolargo
Copy link
Owner

Glances binds directly the libnvidia-ml.so.1 file. Check that this file is available on your system.

find /usr -name 'libnvidia-ml.so*'

The folder where this file is located should be added to LD_LIBRARY_PATH.

So long story short it's more a TrueNAS integration issue than a Glances bug.

@XSvirusSAFE
Copy link

Please see the attached. Screenshot_2025-02-03-13-15-32-84_3d419158bad5872c40592a6c9956e692.jpg

@yeeahnick
Copy link
Author

yeeahnick commented Feb 3, 2025

Glances binds directly the libnvidia-ml.so.1 file. Check that this file is available on your system.

find /usr -name 'libnvidia-ml.so*'

The folder where this file is located should be added to LD_LIBRARY_PATH.

So long story short it's more a TrueNAS integration issue than a Glances bug.

Bonjour et merci de vous impliquer avec ce problème.

Is there something that can be done with the glances container to fix this? I doubt TrueNAS will take a look at this since all my other dockers an community apps have a working GPU without doing anything special. (immich, plex, mkvtoolnix, dashdot, etc..)

I set LD_LIBRARY_PATH to /usr/lib/x86_64-linux-gnu (also tried with /usr/lib64 ) as env variable on the container but it didn't change anything. Also did the same for LD_PRELOAD.

I also tried the alpine-dev tags but got the same results (I do see more info like IP) . I also tried with the official truenas community app but that one doesn't support GPUs at all.

Here is the result of that command in the Glances shell:

/app # find /usr -name 'libnvidia-ml.so*'
/usr/lib64/libnvidia-ml.so.1
/usr/lib64/libnvidia-ml.so.550.127.05
/app # 

/app # ls /usr/lib64
libEGL_nvidia.so.0                       libcudadebugger.so.550.127.05            libnvidia-glsi.so.550.127.05             libnvidia-opticalflow.so.550.127.05
libEGL_nvidia.so.550.127.05              libnvcuvid.so.1                          libnvidia-glvkspirv.so.550.127.05        libnvidia-pkcs11-openssl3.so.550.127.05
libGLESv1_CM_nvidia.so.1                 libnvcuvid.so.550.127.05                 libnvidia-gpucomp.so.550.127.05          libnvidia-pkcs11.so.550.127.05
libGLESv1_CM_nvidia.so.550.127.05        libnvidia-allocator.so.1                 libnvidia-ml.so.1                        libnvidia-ptxjitcompiler.so.1
libGLESv2_nvidia.so.2                    libnvidia-allocator.so.550.127.05        libnvidia-ml.so.550.127.05               libnvidia-ptxjitcompiler.so.550.127.05
libGLESv2_nvidia.so.550.127.05           libnvidia-cfg.so.1                       libnvidia-ngx.so.1                       libnvidia-rtcore.so.550.127.05
libGLX_indirect.so.0                     libnvidia-cfg.so.550.127.05              libnvidia-ngx.so.550.127.05              libnvidia-tls.so.550.127.05
libGLX_nvidia.so.0                       libnvidia-eglcore.so.550.127.05          libnvidia-nvvm.so.4                      libnvoptix.so.1
libGLX_nvidia.so.550.127.05              libnvidia-encode.so.1                    libnvidia-nvvm.so.550.127.05             libnvoptix.so.550.127.05
libcuda.so                               libnvidia-encode.so.550.127.05           libnvidia-opencl.so.1                    libvdpau_nvidia.so.1
libcuda.so.1                             libnvidia-fbc.so.1                       libnvidia-opencl.so.550.127.05           libvdpau_nvidia.so.550.127.05
libcuda.so.550.127.05                    libnvidia-fbc.so.550.127.05              libnvidia-opticalflow.so                 xorg
libcudadebugger.so.1                     libnvidia-glcore.so.550.127.05           libnvidia-opticalflow.so.1

Here is the result of that command in the TrueNAS shell:

root@truenas[~]# find /usr -name 'libnvidia-ml.so*'
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.127.05
root@truenas[~]# 

root@truenas[~]# nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.550.127.05
/usr/lib/x86_64-linux-gnu/libcuda.so.550.127.05
/usr/lib/x86_64-linux-gnu/libcudadebugger.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.550.127.05
/usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvcuvid.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvoptix.so.550.127.05
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.550.127.05
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.550.127.05
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.550.127.05
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.550.127.05
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.550.127.05
/lib/firmware/nvidia/550.127.05/gsp_ga10x.bin
/lib/firmware/nvidia/550.127.05/gsp_tu10x.bin

root@truenas[~]# nvidia-container-cli --version
cli-version: 1.17.4
lib-version: 1.17.4
build date: 2025-01-23T10:53+00:00
build revision: f23e5e55ea27b3680aef363436d4bcf7659e0bfc
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections```



@kbirger
Copy link

kbirger commented Feb 3, 2025

Glances binds directly the libnvidia-ml.so.1 file. Check that this file is available on your system.

find /usr -name 'libnvidia-ml.so*'

The folder where this file is located should be added to LD_LIBRARY_PATH.

So long story short it's more a TrueNAS integration issue than a Glances bug.

I'm not on TrueNAS actually. I'm on Proxmox, which is a fork of debian.

libnvidia-ml.so is found on both my host, and in the container.

From the container: /usr/lib64

total 219456
drwxr-xr-x    2 root     root          4096 Feb  3 15:28 .
drwxr-xr-x    1 root     root          4096 Feb  3 15:28 ..
lrwxrwxrwx    1 root     root            12 Feb  3 15:28 libcuda.so -> libcuda.so.1
lrwxrwxrwx    1 root     root            21 Feb  3 15:28 libcuda.so.1 -> libcuda.so.550.144.03
-rwxr-xr-x    1 root     root      28712096 Jan 30 04:54 libcuda.so.550.144.03
lrwxrwxrwx    1 root     root            29 Feb  3 15:28 libcudadebugger.so.1 -> libcudadebugger.so.550.144.03
-rwxr-xr-x    1 root     root      10524136 Jan 30 04:54 libcudadebugger.so.550.144.03
lrwxrwxrwx    1 root     root            33 Feb  3 15:28 libnvidia-allocator.so.1 -> libnvidia-allocator.so.550.144.03
-rwxr-xr-x    1 root     root        168808 Jan 30 04:54 libnvidia-allocator.so.550.144.03
lrwxrwxrwx    1 root     root            27 Feb  3 15:28 libnvidia-cfg.so.1 -> libnvidia-cfg.so.550.144.03
-rwxr-xr-x    1 root     root        398968 Jan 30 04:54 libnvidia-cfg.so.550.144.03
-rwxr-xr-x    1 root     root      43659040 Jan 30 04:54 libnvidia-gpucomp.so.550.144.03
lrwxrwxrwx    1 root     root            26 Feb  3 15:28 libnvidia-ml.so.1 -> libnvidia-ml.so.550.144.03
-rwxr-xr-x    1 root     root       2082456 Jan 30 04:54 libnvidia-ml.so.550.144.03
lrwxrwxrwx    1 root     root            28 Feb  3 15:28 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.550.144.03
-rwxr-xr-x    1 root     root      86842616 Jan 30 04:54 libnvidia-nvvm.so.550.144.03
lrwxrwxrwx    1 root     root            30 Feb  3 15:28 libnvidia-opencl.so.1 -> libnvidia-opencl.so.550.144.03
-rwxr-xr-x    1 root     root      23613128 Jan 30 04:54 libnvidia-opencl.so.550.144.03
-rwxr-xr-x    1 root     root         10176 Jan 30 04:54 libnvidia-pkcs11-openssl3.so.550.144.03
-rwxr-xr-x    1 root     root         10168 Jan 30 04:54 libnvidia-pkcs11.so.550.144.03
lrwxrwxrwx    1 root     root            38 Feb  3 15:28 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.550.144.03
-rwxr-xr-x    1 root     root      28674464 Jan 30 04:54 libnvidia-ptxjitcompiler.so.550.144.03
/app #3003 

I assume that you meant taht the env var must be set to the path in the container, otherwise it would also be necessary to mount the file from the host.

Setting it to /usr/lib64 doesn't make a difference

@yeeahnick
Copy link
Author

yeeahnick commented Feb 8, 2025

@nicolargo

Hi,

Can we change the label "needs more info" to "need investigation".

A few of us provided a lot of info and we all have the same results and issue. TrueNAS and Proxmox are affected.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants