Merge pull request #101 from NVIDIA/dev2.8

Dev2.8
NVIDIA · Mar 31, 2021 · 4d2edc0 · 4d2edc0
2 parents 3eb6f24 + 9f82e23
commit 4d2edc0
Show file tree

Hide file tree

Showing 13 changed files with 688 additions and 192 deletions.
diff --git a/README.md b/README.md
@@ -32,18 +32,18 @@ _For usage and command documentation: `./data-science-stack help` at any time._
 _Note: The script is designed to run as the user, and ask for sudo password
 when needed. Do not run it with `sudo ...`
 
-On Ubuntu 18.04 or 20.04:
+On Ubuntu 18.04, 20.04, or Red Hat Enterprise Linux (RHEL) 8.x:
 
 ```bash
-git clone github.com/NVIDIA/data-science-stack
+git clone https://github.com/NVIDIA/data-science-stack
 cd data-science-stack
 ./data-science-stack setup-system
 ````
 
-On Red Hat Enterprise Linux (RHEL) Workstation 7.x or 8.x:
+On RHEL Workstation 7.x:
 
 ```bash
-git clone github.com/NVIDIA/data-science-stack
+git clone https://github.com/NVIDIA/data-science-stack
 cd data-science-stack
 ./data-science-stack setup-system
 # script will stop, manually install driver ... (instructions below)
@@ -140,6 +140,13 @@ From the command line in your environment, or inside the container, the
 since the notebooks can depend on functions only available when using
 Jupyter's web UI.
 
+### Local Tools
+Version 2.7.0 introduced the `install-tools` command (paired with `purge-tools`), which extends the functionality of the stack. Currently, the list includes:
+* [jupyter-repo2docker](https://github.com/jupyterhub/repo2docker) Point it to a github repository and it will create a docker container, and launch a jupyter notebook inside it
+* [Nvidia GPU Cloud CLI](https://ngc.nvidia.com)  This is perhaps the easiest way to interact with Nvidia assets
+* [Kaggle CLI](https://github.com/Kaggle/kaggle-api) Allows users to sync up and manage Kaggle kernels, datasets, etc. locally
+* [AWS CLI](https://github.com/aws/aws-cli) Allows users to remotely manage resources in AWS. The stack supports it via docker, so make sure you have docker installed.
+
 ### Creating Custom Stacks
 
 Creating custom environments is covered in the
@@ -196,11 +203,11 @@ Then, create a a Ubuntu or RHEL VM, open a terminal, and follow OS-specific inst
 ## Installing the NVIDIA GPU Driver
 
 It is important that updated NVIDIA drivers are installed on the system.
-The minimum version of the NVIDIA driver supported is 455.23.04.
+The minimum version of the NVIDIA driver supported is 460.39.
 More recent drivers may be available, but may not have been tested with the
 data science stacks.
 
-### Ubuntu Driver Install
+### Ubuntu or RHEL v8.x Driver Install
 
 Driver install for Ubuntu is handled by `data-science-stack setup-system`
 so no manual install should be required.
@@ -215,7 +222,7 @@ be removed (this may have side effects, read the warnings) and reinstalled:
 # reboot
 ```
 
-### Red Hat Enterprise Linux Workstation (RHEL) Driver Install
+### RHEL v7.x Driver Install
 
 Before attempting to install the driver check that the system does not
 have `/usr/bin/nvidia-uninstall` which is left by an old driver .run file.
@@ -297,8 +304,8 @@ Download and install the driver:
 
 ```bash
 # Check for the latest before using - https://www.nvidia.com/Download/index.aspx
-wget http://us.download.nvidia.com/XFree86/Linux-x86_64/455.23.04/NVIDIA-Linux-x86_64-455.23.04.run
-sudo sh ./NVIDIA-Linux-x86_64-455.23.04.run
+wget https://us.download.nvidia.com/XFree86/Linux-x86_64/460.56/NVIDIA-Linux-x86_64-460.56.run
+sudo sh ./NVIDIA-Linux-x86_64-460.56.run
 ```
 
 > **Note**: In some cases the following prompts will occur:
@@ -559,7 +566,7 @@ script will let you know how to remove the old driver.
 
 ### How much disk space is needed?
 
-About 30GB free should be enough. A lot of space is needed during
+About 50GB free should be enough. A lot of space is needed during
 environment/container creation since Conda has a package cache.
 
 ### The script is failing after it cannot reach URLs or download files

diff --git a/benchmarks/image_classification/pl/Dockerfile b/benchmarks/image_classification/pl/Dockerfile
@@ -0,0 +1,7 @@
+FROM nvcr.io/nvidia/pytorch:21.02-py3 
+
+RUN pip install pytorch-lightning==1.2.2
+RUN pip install torchmetrics
+
+RUN git clone https://github.com/PyTorchLightning/pytorch-lightning.git
+COPY test.sh /
diff --git a/benchmarks/image_classification/pl/Readme.md b/benchmarks/image_classification/pl/Readme.md
@@ -0,0 +1,9 @@
+# Image Classification Speed Test
+
+This example is based on a 
+[PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/domain_templates/computer_vision_fine_tuning.py)
+image classification with transfer learning code. The provided `run.sh` script will build a docker container for you and run the test.  You can edit the 
+script and specify the batch size, number of epochs, GPUs (set to 0 for a CPU test), and well as the number of cores / workers. 
+
+On a GPU, this test takes just a 
+few minutes to run, depending on the model. It will likely take quite a bit longer on CPU.
diff --git a/benchmarks/image_classification/pl/run.sh b/benchmarks/image_classification/pl/run.sh
@@ -0,0 +1,30 @@
+#!/bin/sh
+
+# the name for the transient docker image
+IMG=foo
+
+
+# use 128 for 16G cards
+# batch size 16 takes less than 5 GB of GPU mem
+BATCH_SIZE=128
+RUN_EPOCHS=15
+# set to 0 for a CPU only test
+# on a multi-gpu machine, setting to > 1 works
+# but the test has overhead so the results are not representative
+GPUS=1
+
+# this setting should likely match the number of cores in the system
+# this is the number of cores to use for the dataloader
+WORKERS=16
+
+echo `date` building docker image
+docker build -t ${IMG} -f Dockerfile .
+
+echo `date` launching...
+
+# the idea is to do two runs and time the second
+# because the first (shorter, 1 epoch) run will download and prepare / cache the data we don't care about timing that
+docker run --rm --ipc=host ${IMG} /test.sh ${BATCH_SIZE} ${RUN_EPOCHS} ${GPUS} ${WORKERS}
+
+# docker rmi ${IMG}
+echo `date` all done
diff --git a/benchmarks/image_classification/pl/test.sh b/benchmarks/image_classification/pl/test.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+
+
+if [ $# -lt 4 ]
+then
+  echo "use: test.sh <batch_size> <epochs> <gpus> <num_workers>"
+  exit 1
+fi
+
+BATCH="$1"
+EPOCHS="$2"
+GPUS="$3"
+WORKERS="$4"
+
+echo "warmup run starting"
+python /workspace/pytorch-lightning/pl_examples/domain_templates/computer_vision_fine_tuning.py --epochs 1 --batch-size ${BATCH} --gpus ${GPUS} --num_workers ${WORKERS} 
+
+echo "running timed test"
+time python /workspace/pytorch-lightning/pl_examples/domain_templates/computer_vision_fine_tuning.py --epochs ${EPOCHS} --batch-size ${BATCH} --gpus ${GPUS} --num_workers ${WORKERS}
+
diff --git a/benchmarks/rapids/knn/DigitRecognizer/Readme.md b/benchmarks/rapids/knn/DigitRecognizer/Readme.md
@@ -0,0 +1,10 @@
+# kNN Speed Test
+
+This example is based on Chris Deotte's [Kaggle notebook](https://www.kaggle.com/cdeotte/rapids-gpu-knn-mnist-0-97/notebook), where a GPU-accelerated kNN classifier is used 
+is used in the Kaggle MNIST competition. The GPU / CPU speedup will depend on your hardware, but we routinely see 100+x performance improvements.  On a GPU, 
+the 100x inference (cell 13) takes less than a minute to run. To run the same test on CPU, select the number of cores (cell 14) and then run the test in cell 15.
+
+These massive speedups are a game changer when it comes to rapid experimentation, model architecture selection, and hyperparameter optimization. 
+
+To run this example, simply launch the data science stack container or conda environment and run the notebook. Alternatively, you could use one of RAPIDS containers.
+
diff --git a/benchmarks/rapids/knn/DigitRecognizer/rapids-gpu-knn-mnist-0-97.ipynb b/benchmarks/rapids/knn/DigitRecognizer/rapids-gpu-knn-mnist-0-97.ipynb
diff --git a/benchmarks/rapids/knn/DigitRecognizer/test.csv.zip b/benchmarks/rapids/knn/DigitRecognizer/test.csv.zip
diff --git a/benchmarks/rapids/knn/DigitRecognizer/train.csv.zip b/benchmarks/rapids/knn/DigitRecognizer/train.csv.zip