Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpuless deployments #469

Open
sandwichdoge opened this issue Sep 5, 2024 · 3 comments
Open

Gpuless deployments #469

sandwichdoge opened this issue Sep 5, 2024 · 3 comments

Comments

@sandwichdoge
Copy link

Hello, I'd like to create a k8s deployment without GPUs, however my nvidia.com/gpu config doesn't work:

      limits:
        cpu: "2"
        memory: 4Gi
        nvidia.com/gpu: "0"
      requests:
        cpu: "2"
        memory: 4Gi
        nvidia.com/gpu: "0"

I confirmed that requesting 1 or more GPUs with a certain amount of VRAM works. But requesting "0" simply exposes all the GPUs running on that worker node to the running pod(s):

kubectl -n 0cd68651-0852-4cfa-9ebc-b2c42f02f746 exec -it nogpus-776f988c55-9cq84 nvidia-smi
Defaulted container "nogpus" out of: nogpus, init-chown-data (init)
Thu Sep  5 08:13:48 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     Off |   00000000:00:10.0 Off |                  Off |
|  0%   23C    P8             20W /  300W |       1MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Here's my full deployment.yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    hami.io/gpu-scheduler-policy: ""
    hami.io/node-scheduler-policy: ""
  creationTimestamp: "2024-09-05T08:13:15Z"
  generation: 1
  labels:
    app: nogpus
  name: nogpus
  namespace: 0cd68651-0852-4cfa-9ebc-b2c42f02f746
  resourceVersion: "27240742"
  uid: c1b6d40b-ff10-4b51-b288-5b1d2322dcf4
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app: nogpus
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nogpus
    spec:
      containers:
      - env:
        - name: NOTEBOOK_ARGS
          value: --NotebookApp.token='9cd4990a-e599-4a74-8500-e4d42149738b'
        image: registry.fusionflow.cloud/notebook/pytorch-notebook:cuda12-python-3.11-nvdashboard
        imagePullPolicy: IfNotPresent
        name: nogpus
        ports:
        - containerPort: 8888
          protocol: TCP
        resources:
          limits:
            cpu: "2"
            ephemeral-storage: 200Mi
            memory: 4Gi
            nvidia.com/gpu: "0"
          requests:
            cpu: "2"
            ephemeral-storage: 200Mi
            memory: 4Gi
            nvidia.com/gpu: "0"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /home/jovyan/work
          name: notebook-data
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: nogpus
      initContainers:
      - command:
        - /bin/chown
        - -R
        - 1000:100
        - /home/jovyan/work
        image: mirror.gcr.io/library/busybox:1.31.1
        imagePullPolicy: IfNotPresent
        name: init-chown-data
        resources: {}
        securityContext:
          privileged: true
          runAsUser: 0
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /home/jovyan/work
          name: notebook-data
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - name: notebook-data
        persistentVolumeClaim:
          claimName: notebook-data
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2024-09-05T08:13:27Z"
    lastUpdateTime: "2024-09-05T08:13:27Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2024-09-05T08:13:15Z"
    lastUpdateTime: "2024-09-05T08:13:27Z"
    message: ReplicaSet "nogpus-776f988c55" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

Any pointers? Thank you.

@archlitchi
Copy link
Collaborator

yes, you should add env 'NVIDIA_VISIBLE_DEVICES=none' to this container. please refer to this issue #464

@sandwichdoge
Copy link
Author

sandwichdoge commented Sep 5, 2024

@archlitchi Thanks for the reply. If the pod user overrides this env var, will they still be able to see all the GPUs? I'm working with a low-trust environment where the pod user can only use their own allocated VRAM.

I'm aware there's an option to prevent pod user from overriding env vars:

sudo vi /etc/nvidia-container-runtime/config.toml
# need these lines:
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false

However, enabling these lines causes the pods with allocated GPUs to crash with error:

NAME                    READY   STATUS             RESTARTS     AGE
gpus-5bcbc4d55b-zkcsz   0/1     CrashLoopBackOff   1 (2s ago)   5s

kubectl -n 09e5313f-659a-499a-9085-e600df6ea705 logs -f gpus-5bcbc4d55b-zkcsz
Defaulted container "gpus" out of: gpus, init-chown-data (init)
tini: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or director

@archlitchi
Copy link
Collaborator

@archlitchi Thanks for the reply. If the pod user overrides this env var, will they still be able to see all the GPUs? I'm working with a low-trust environment where the pod user can only use their own allocated VRAM.

I'm aware there's an option to prevent pod user from overriding env vars:

sudo vi /etc/nvidia-container-runtime/config.toml
# need these lines:
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false

However, enabling these lines causes the pods with allocated GPUs to crash with error:

NAME                    READY   STATUS             RESTARTS     AGE
gpus-5bcbc4d55b-zkcsz   0/1     CrashLoopBackOff   1 (2s ago)   5s

kubectl -n 09e5313f-659a-499a-9085-e600df6ea705 logs -f gpus-5bcbc4d55b-zkcsz
Defaulted container "gpus" out of: gpus, init-chown-data (init)
tini: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or director

indeed, if you enabling these lines, device-plugin will not be working properly, because it needs to set 'NVIDIA_VISIBLE_DEVICES' in order to assign GPUs to pods.
also, user can directly patch these env into the image, which couldn't be discovered by these line.
best-practice is to add a mutating-webhook-configuration for each pod, to add 'NVIDIA_VISIBLE_DEVICES=none' to each contaienr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants