Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]dashboard无法识别GPU #185

Closed
CaRRotOne opened this issue Sep 27, 2021 · 5 comments
Closed

[BUG]dashboard无法识别GPU #185

CaRRotOne opened this issue Sep 27, 2021 · 5 comments
Assignees

Comments

@CaRRotOne
Copy link

What happened:
dashboard无法识别GPU,GPU为 nvidia显卡, K8S已经安装对应nvidia插件

What you expected to happen:
解决dashboard GPU显示问题

How to reproduce it:

Anything else we need to know?:

Environment:

  • KubeDL version: lastest
  • Kubernetes version (use kubectl version): v1.17.17
  • OS (e.g: cat /etc/os-release): ubuntu
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
    image
@CaRRotOne CaRRotOne changed the title [BUG] [BUG]dashboard无法识别GPU Sep 27, 2021
@SimonCqk
Copy link
Collaborator

@CaRRotOne 好的,我们看下

麻烦提供一下集群的node信息,nvidia插件是否有正常工作,以及node allocatable资源上报是否符合预期

@CaRRotOne
Copy link
Author

CaRRotOne commented Sep 27, 2021

@CaRRotOne 好的,我们看下

麻烦提供一下集群的node信息,nvidia插件是否有正常工作,以及node allocatable资源上报是否符合预期

@SimonCqk
集群为rancher搭建的k8s,具体信息图下。node信息中看到Capacit GPU为4个,但Allocatable GPU为0。

kubectl cluster-info
Kubernetes master is running at https://ml.rancher.pudu.cn:9443/k8s/clusters/c-tzdlr
CoreDNS is running at https://ml.rancher.pudu.cn:9443/k8s/clusters/c-tzdlr/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
kubectl get nodes
NAME                  STATUS   ROLES                      AGE     VERSION
dell-poweredge-t640   Ready    controlplane,etcd,worker   2d23h   v1.17.17

nvidia插件已正常工作

kubectl get pods -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-sfckg      1/1     Running     0          41m

node信息

Name:               dell-poweredge-t640
Roles:              controlplane,etcd,worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=dell-poweredge-t640
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/controlplane=true
                    node-role.kubernetes.io/etcd=true
                    node-role.kubernetes.io/worker=true
Annotations:        flannel.alpha.coreos.com/backend-data: {"VtepMAC":"f6:e5:f7:7e:d8:d6"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 192.168.161.90
                    node.alpha.kubernetes.io/ttl: 0
                    rke.cattle.io/external-ip: 192.168.161.90
                    rke.cattle.io/internal-ip: 192.168.161.90
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 24 Sep 2021 05:35:25 -0400
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  dell-poweredge-t640
  AcquireTime:     <unset>
  RenewTime:       Mon, 27 Sep 2021 08:20:40 -0400
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 27 Sep 2021 08:13:04 -0400   Mon, 27 Sep 2021 08:13:04 -0400   FlannelIsUp                  Flannel is running on this node
  MemoryPressure       False   Mon, 27 Sep 2021 08:19:02 -0400   Fri, 24 Sep 2021 05:35:25 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 27 Sep 2021 08:19:02 -0400   Fri, 24 Sep 2021 05:35:25 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 27 Sep 2021 08:19:02 -0400   Fri, 24 Sep 2021 05:35:25 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Mon, 27 Sep 2021 08:19:02 -0400   Mon, 27 Sep 2021 08:13:01 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  192.168.161.90
  Hostname:    dell-poweredge-t640
Capacity:
  cpu:                40
  ephemeral-storage:  459924552Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131501508Ki
  nvidia.com/gpu:     4
  pods:               110
Allocatable:
  cpu:                40
  ephemeral-storage:  423866466422
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131399108Ki
  nvidia.com/gpu:     0
  pods:               110
System Info:
  Machine ID:                 a8eb6cac33e701ae867269db5ce80e7f
  System UUID:                4c4c4544-0058-3010-8038-b3c04f4a4633
  Boot ID:                    aa03ac05-95f8-4d85-9c14-48d761375c2d
  Kernel Version:             5.4.0-42-generic
  OS Image:                   Ubuntu 18.04.5 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.7
  Kubelet Version:            v1.17.17
  Kube-Proxy Version:         v1.17.17
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
Non-terminated Pods:          (24 in total)
  Namespace                   Name                                                       CPU Requests  CPU Limits   Memory Requests  Memory Limits  AGE
  ---------                   ----                                                       ------------  ----------   ---------------  -------------  ---
  cattle-prometheus           exporter-kube-state-cluster-monitoring-5dd6d5c9fd-qzfq4    100m (0%)     100m (0%)    130Mi (0%)       200Mi (0%)     34h
  cattle-prometheus           exporter-node-cluster-monitoring-r9f79                     100m (0%)     200m (0%)    30Mi (0%)        200Mi (0%)     34h
  cattle-prometheus           grafana-cluster-monitoring-75c5cd5995-m77pz                150m (0%)     300m (0%)    150Mi (0%)       300Mi (0%)     34h
  cattle-prometheus           prometheus-cluster-monitoring-0                            1100m (2%)    1800m (4%)   950Mi (0%)       1350Mi (1%)    34h
  cattle-prometheus           prometheus-operator-monitoring-operator-f9b9567b-hklgl     100m (0%)     200m (0%)    100Mi (0%)       500Mi (0%)     34h
  cattle-system               cattle-cluster-agent-6cc5cdcc54-5sq4j                      0 (0%)        0 (0%)       0 (0%)           0 (0%)         2d22h
  cattle-system               cattle-node-agent-867qh                                    0 (0%)        0 (0%)       0 (0%)           0 (0%)         2d22h
  cattle-system               kube-api-auth-t24lb                                        0 (0%)        0 (0%)       0 (0%)           0 (0%)         3d2h
  ingress-nginx               nginx-ingress-controller-ftjzp                             0 (0%)        0 (0%)       0 (0%)           0 (0%)         3d2h
  istio-system                istio-citadel-66864ff6b8-smnbx                             10m (0%)      0 (0%)       0 (0%)           0 (0%)         34h
  istio-system                istio-galley-5bd9bf8b9c-wc8gg                              10m (0%)      0 (0%)       0 (0%)           0 (0%)         34h
  istio-system                istio-pilot-674bdcbbf9-v2zcl                               600m (1%)     3 (7%)       2176Mi (1%)      5Gi (3%)       34h
  istio-system                istio-policy-6d9f4577db-s96ht                              1100m (2%)    6800m (17%)  1152Mi (0%)      5Gi (3%)       34h
  istio-system                istio-sidecar-injector-9bcfb645-vp22d                      10m (0%)      0 (0%)       0 (0%)           0 (0%)         34h
  istio-system                istio-telemetry-664b6dfd44-df5sq                           1100m (2%)    6800m (17%)  1152Mi (0%)      5Gi (3%)       34h
  istio-system                istio-tracing-cc6c8c677-crd6g                              100m (0%)     500m (1%)    100Mi (0%)       1Gi (0%)       34h
  istio-system                kiali-79c4c46468-pb5dv                                     10m (0%)      0 (0%)       0 (0%)           0 (0%)         34h
  kube-system                 coredns-6b84d75d99-5dvkt                                   100m (0%)     0 (0%)       70Mi (0%)        170Mi (0%)     3d2h
  kube-system                 coredns-autoscaler-5c4b6999d9-qt25w                        20m (0%)      0 (0%)       10Mi (0%)        0 (0%)         3d2h
  kube-system                 kube-flannel-slmh5                                         100m (0%)     100m (0%)    50Mi (0%)        50Mi (0%)      3d2h
  kube-system                 metrics-server-7579449c57-t9jmf                            0 (0%)        0 (0%)       0 (0%)           0 (0%)         3d2h
  kube-system                 nvidia-device-plugin-daemonset-sfckg                       0 (0%)        0 (0%)       0 (0%)           0 (0%)         4h15m
  kubedl-system               kubedl-7f4c55dfc9-8n2pc                                    1024m (2%)    2048m (5%)   1Gi (0%)         2Gi (1%)       150m
  kubedl-system               kubedl-dashboard-787b49c8d7-7lbmg                          1 (2%)        0 (0%)       500Mi (0%)       0 (0%)         150m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                6734m (16%)  21848m (54%)
  memory             7594Mi (5%)  21202Mi (16%)
  ephemeral-storage  0 (0%)       0 (0%)
  nvidia.com/gpu     0            0
Events:
  Type    Reason                   Age                    From                             Message
  ----    ------                   ----                   ----                             -------
  Normal  NodeAllocatableEnforced  7m49s                  kubelet, dell-poweredge-t640     Updated Node Allocatable limit across pods
  Normal  NodeHasSufficientMemory  7m48s (x3 over 7m49s)  kubelet, dell-poweredge-t640     Node dell-poweredge-t640 status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    7m48s (x3 over 7m49s)  kubelet, dell-poweredge-t640     Node dell-poweredge-t640 status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     7m48s (x3 over 7m49s)  kubelet, dell-poweredge-t640     Node dell-poweredge-t640 status is now: NodeHasSufficientPID
  Normal  NodeNotReady             7m48s                  kubelet, dell-poweredge-t640     Node dell-poweredge-t640 status is now: NodeNotReady
  Normal  NodeReady                7m47s                  kubelet, dell-poweredge-t640     Node dell-poweredge-t640 status is now: NodeReady
  Normal  Starting                 7m46s                  kube-proxy, dell-poweredge-t640  Starting kube-proxy.

@CaRRotOne
Copy link
Author

已解决,docker的default-runtime需要设置成nvidia

@SimonCqk
Copy link
Collaborator

已解决,docker的default-runtime需要设置成nvidia

感谢关注,那我关闭这个issue啦 :)

@CaRRotOne
Copy link
Author

@SimonCqk 好的,感谢,持续关注kubedl,愿产品越来越好!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants