Skip to content

Commit

Permalink
Doc updates
Browse files Browse the repository at this point in the history
Update README for 1.2.0 release
  • Loading branch information
dualvtable committed Apr 6, 2020
1 parent b144ffc commit 60e8d8a
Showing 1 changed file with 21 additions and 12 deletions.
33 changes: 21 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
# NVIDIA GPU Operator

Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the [device plugin framework](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/). However, configuring and managing nodes with these hardware resources requires configuration of multiple software components such as drivers, container runtimes or other libraries which are difficult and prone to errors.
The NVIDIA GPU Operator uses the [operator framework](https://coreos.com/blog/introducing-operator-framework) within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling and others.

The GPU Operator is now G.A (Generally Available). This release of the GPU Operator adds support for Red Hat OpenShift 4 and includes deployment of [DCGM](https://developer.nvidia.com/dcgm) based monitoring as part of the GPU Operator.
The NVIDIA GPU Operator uses the [operator framework](https://coreos.com/blog/introducing-operator-framework) within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, [DCGM](https://developer.nvidia.com/dcgm) based monitoring and others.

## Audience and Use-Cases
The GPU Operator allows administrators of Kubernetes clusters to manage GPU nodes just like CPU nodes in the cluster. Instead of provisioning a special OS image for GPU nodes, administrators can rely on a standard OS image for both CPU and GPU nodes and then rely on the GPU Operator to provision the required software components for GPUs.
Expand All @@ -18,15 +16,15 @@ The GPU Operator is not a good fit for scenarios when special OS images are alre
- Kubernetes v1.13+
- Note that the Kubernetes community supports only the last three minor releases as of v1.17. Older releases may be supported through enterprise distributions of Kubernetes such as Red Hat OpenShift. See the prerequisites for enabling monitoring in Kubernetes releases before v1.16.
- Helm v3 (v3.1.1)
- Docker CE 19.03.6
- Docker CE 19.03.z
- Red Hat OpenShift 4.1, 4.2 and 4.3 using Red Hat Enterprise Linux CoreOS (RHCOS) and CRI-O container runtime
- Ubuntu 18.04.4 LTS
- Note that the GA has been validated with the 4.15 LTS kernel. When using the HWE kernel (v5.3), there are additional prerequisites before deploying the operator.
- Ubuntu 18.04.z LTS
- Note that the GA has been validated with the 4.15 LTS kernel. When using the HWE kernel (e.g. v5.3), there are additional prerequisites before deploying the operator.
- The GPU operator has been validated with the following NVIDIA components:
- NVIDIA Container Toolkit 1.0.5
- NVIDIA Kubernetes Device Plugin 1.0.0-beta4
- NVIDIA Tesla Driver 440.33.01
- NVIDIA DCGM 1.7.2 (only supported on Ubuntu 18.04.4 LTS)
- NVIDIA Tesla Driver 440 ( Current release is 440.64.00. See driver [release notes](https://docs.nvidia.com/datacenter/tesla/#r440-driver-release-notes))
- NVIDIA DCGM 1.7.2


## Getting Started
Expand All @@ -39,7 +37,7 @@ The GPU Operator is not a good fit for scenarios when special OS images are alre
$ helm install --devel --set nfd.enabled=false nvidia/gpu-operator -n test-operator
```
- See notes on [NFD setup](https://github.com/kubernetes-sigs/node-feature-discovery)
- For monitoring in Kubernetes >=1.13 and <1.15, enable the kubelet ["KubeletPodResources" feature gate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/). From Kubernetes 1.15 onwards, its enabled by default.
- For monitoring in Kubernetes 1.13 and 1.14, enable the kubelet ["KubeletPodResources" feature gate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/). From Kubernetes 1.15 onwards, its enabled by default.
```sh
$ echo -e "KUBELET_EXTRA_ARGS=--feature-gates=KubeletPodResources=true" | sudo tee /etc/default/kubelet
```
Expand Down Expand Up @@ -184,18 +182,29 @@ $ curl $prom_server_ip:9090
# Import this GPU metrics dashboard from Grafana https://grafana.com/grafana/dashboards/11578
```

## Changelog for v1.0.0
## Changelog
### v1.2.0
### New Features
- DCGM is now deployed as part of the GPU Operator on OpenShift 4.3
### Improvements
### Fixed Issues
### Known Limitations
- GPU Operator will fail on nodes already setup with NVIDIA components (driver, runtime, device plugin). Support for better error handling will be added in a future release.
- The GPU Operator currently does not handle updates to the underlying software components (e.g. drivers) in an automated manner.
- This release of the operator does not support accessing images from private registries, which may be required for air-gapped deployments.

### v1.0.0
#### New Features
- Added support for Helm v3. Note that installing the GPU Operator using Helm v2 is no longer supported.
- Added support for Red Hat OpenShift 4 (4.1, 4.2 and 4.3) using Red Hat Enterprise Linux Core OS (RHCOS) and CRI-O runtime on GPU worker nodes.
- GPU Operator now deploys NVIDIA DCGM for GPU telemetry on Ubuntu 18.04 LTS

### Fixed Issues
#### Fixed Issues
- The driver container now sets up the required dependencies on i2c and ipmi_msghandler modules.
- Fixed an issue with the validation steps (for the driver and device plugin) taking considerable time. Node provisioning times are now improved by 5x.
- The SRO custom resource definition is setup as part of the operator.
- Fixed an issue with the clean up of driver mount files when deleting the operator from the cluster. This issue used to require a reboot of the node, which is no longer required.
### Known Limitations
#### Known Limitations
- After the removal of the GPU Operator, a restart of the Docker daemon is required as the default container runtime is setup to be the NVIDIA runtime. Run the following command:
```sh
$ sudo systemctl restart docker
Expand Down

0 comments on commit 60e8d8a

Please sign in to comment.