Skip to content

Commit

Permalink
Merge pull request #101 from NVIDIA/dev2.8
Browse files Browse the repository at this point in the history
Dev2.8
  • Loading branch information
bmwshop authored Mar 31, 2021
2 parents 3eb6f24 + 9f82e23 commit 4d2edc0
Show file tree
Hide file tree
Showing 13 changed files with 688 additions and 192 deletions.
27 changes: 17 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,18 +32,18 @@ _For usage and command documentation: `./data-science-stack help` at any time._
_Note: The script is designed to run as the user, and ask for sudo password
when needed. Do not run it with `sudo ...`

On Ubuntu 18.04 or 20.04:
On Ubuntu 18.04, 20.04, or Red Hat Enterprise Linux (RHEL) 8.x:

```bash
git clone github.com/NVIDIA/data-science-stack
git clone https://github.com/NVIDIA/data-science-stack
cd data-science-stack
./data-science-stack setup-system
````

On Red Hat Enterprise Linux (RHEL) Workstation 7.x or 8.x:
On RHEL Workstation 7.x:

```bash
git clone github.com/NVIDIA/data-science-stack
git clone https://github.com/NVIDIA/data-science-stack
cd data-science-stack
./data-science-stack setup-system
# script will stop, manually install driver ... (instructions below)
Expand Down Expand Up @@ -140,6 +140,13 @@ From the command line in your environment, or inside the container, the
since the notebooks can depend on functions only available when using
Jupyter's web UI.
### Local Tools
Version 2.7.0 introduced the `install-tools` command (paired with `purge-tools`), which extends the functionality of the stack. Currently, the list includes:
* [jupyter-repo2docker](https://github.com/jupyterhub/repo2docker) Point it to a github repository and it will create a docker container, and launch a jupyter notebook inside it
* [Nvidia GPU Cloud CLI](https://ngc.nvidia.com) This is perhaps the easiest way to interact with Nvidia assets
* [Kaggle CLI](https://github.com/Kaggle/kaggle-api) Allows users to sync up and manage Kaggle kernels, datasets, etc. locally
* [AWS CLI](https://github.com/aws/aws-cli) Allows users to remotely manage resources in AWS. The stack supports it via docker, so make sure you have docker installed.
### Creating Custom Stacks
Creating custom environments is covered in the
Expand Down Expand Up @@ -196,11 +203,11 @@ Then, create a a Ubuntu or RHEL VM, open a terminal, and follow OS-specific inst
## Installing the NVIDIA GPU Driver
It is important that updated NVIDIA drivers are installed on the system.
The minimum version of the NVIDIA driver supported is 455.23.04.
The minimum version of the NVIDIA driver supported is 460.39.
More recent drivers may be available, but may not have been tested with the
data science stacks.
### Ubuntu Driver Install
### Ubuntu or RHEL v8.x Driver Install
Driver install for Ubuntu is handled by `data-science-stack setup-system`
so no manual install should be required.
Expand All @@ -215,7 +222,7 @@ be removed (this may have side effects, read the warnings) and reinstalled:
# reboot
```
### Red Hat Enterprise Linux Workstation (RHEL) Driver Install
### RHEL v7.x Driver Install
Before attempting to install the driver check that the system does not
have `/usr/bin/nvidia-uninstall` which is left by an old driver .run file.
Expand Down Expand Up @@ -297,8 +304,8 @@ Download and install the driver:
```bash
# Check for the latest before using - https://www.nvidia.com/Download/index.aspx
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/455.23.04/NVIDIA-Linux-x86_64-455.23.04.run
sudo sh ./NVIDIA-Linux-x86_64-455.23.04.run
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/460.56/NVIDIA-Linux-x86_64-460.56.run
sudo sh ./NVIDIA-Linux-x86_64-460.56.run
```
> **Note**: In some cases the following prompts will occur:
Expand Down Expand Up @@ -559,7 +566,7 @@ script will let you know how to remove the old driver.
### How much disk space is needed?
About 30GB free should be enough. A lot of space is needed during
About 50GB free should be enough. A lot of space is needed during
environment/container creation since Conda has a package cache.
### The script is failing after it cannot reach URLs or download files
Expand Down
7 changes: 7 additions & 0 deletions benchmarks/image_classification/pl/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
FROM nvcr.io/nvidia/pytorch:21.02-py3

RUN pip install pytorch-lightning==1.2.2
RUN pip install torchmetrics

RUN git clone https://github.com/PyTorchLightning/pytorch-lightning.git
COPY test.sh /
9 changes: 9 additions & 0 deletions benchmarks/image_classification/pl/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Image Classification Speed Test

This example is based on a
[PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/domain_templates/computer_vision_fine_tuning.py)
image classification with transfer learning code. The provided `run.sh` script will build a docker container for you and run the test. You can edit the
script and specify the batch size, number of epochs, GPUs (set to 0 for a CPU test), and well as the number of cores / workers.

On a GPU, this test takes just a
few minutes to run, depending on the model. It will likely take quite a bit longer on CPU.
30 changes: 30 additions & 0 deletions benchmarks/image_classification/pl/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/bin/sh

# the name for the transient docker image
IMG=foo


# use 128 for 16G cards
# batch size 16 takes less than 5 GB of GPU mem
BATCH_SIZE=128
RUN_EPOCHS=15
# set to 0 for a CPU only test
# on a multi-gpu machine, setting to > 1 works
# but the test has overhead so the results are not representative
GPUS=1

# this setting should likely match the number of cores in the system
# this is the number of cores to use for the dataloader
WORKERS=16

echo `date` building docker image
docker build -t ${IMG} -f Dockerfile .

echo `date` launching...

# the idea is to do two runs and time the second
# because the first (shorter, 1 epoch) run will download and prepare / cache the data we don't care about timing that
docker run --rm --ipc=host ${IMG} /test.sh ${BATCH_SIZE} ${RUN_EPOCHS} ${GPUS} ${WORKERS}

# docker rmi ${IMG}
echo `date` all done
20 changes: 20 additions & 0 deletions benchmarks/image_classification/pl/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash


if [ $# -lt 4 ]
then
echo "use: test.sh <batch_size> <epochs> <gpus> <num_workers>"
exit 1
fi

BATCH="$1"
EPOCHS="$2"
GPUS="$3"
WORKERS="$4"

echo "warmup run starting"
python /workspace/pytorch-lightning/pl_examples/domain_templates/computer_vision_fine_tuning.py --epochs 1 --batch-size ${BATCH} --gpus ${GPUS} --num_workers ${WORKERS}

echo "running timed test"
time python /workspace/pytorch-lightning/pl_examples/domain_templates/computer_vision_fine_tuning.py --epochs ${EPOCHS} --batch-size ${BATCH} --gpus ${GPUS} --num_workers ${WORKERS}

10 changes: 10 additions & 0 deletions benchmarks/rapids/knn/DigitRecognizer/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# kNN Speed Test

This example is based on Chris Deotte's [Kaggle notebook](https://www.kaggle.com/cdeotte/rapids-gpu-knn-mnist-0-97/notebook), where a GPU-accelerated kNN classifier is used
is used in the Kaggle MNIST competition. The GPU / CPU speedup will depend on your hardware, but we routinely see 100+x performance improvements. On a GPU,
the 100x inference (cell 13) takes less than a minute to run. To run the same test on CPU, select the number of cores (cell 14) and then run the test in cell 15.

These massive speedups are a game changer when it comes to rapid experimentation, model architecture selection, and hyperparameter optimization.

To run this example, simply launch the data science stack container or conda environment and run the notebook. Alternatively, you could use one of RAPIDS containers.

313 changes: 313 additions & 0 deletions benchmarks/rapids/knn/DigitRecognizer/rapids-gpu-knn-mnist-0-97.ipynb

Large diffs are not rendered by default.

Binary file not shown.
Binary file not shown.
Loading

0 comments on commit 4d2edc0

Please sign in to comment.