Merge branch 'v21.9-integration-merge-master' into 'v21.9-integration'

zehuanw2 · zehuanw2 · commit bdaa975f04a1 · 2021-09-12T22:56:37.000-07:00
Merge origin/master

See merge request dl/hugectr/hugectr!453
diff --git a/sparse_operation_kit/documents/tutorials/DenseDemo/benchmark.ipynb b/sparse_operation_kit/documents/tutorials/DenseDemo/benchmark.ipynb
@@ -0,0 +1,312 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cdfec37b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n",
+    "#\n",
+    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+    "# you may not use this file except in compliance with the License.\n",
+    "# You may obtain a copy of the License at\n",
+    "#\n",
+    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing, software\n",
+    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+    "# See the License for the specific language governing permissions and\n",
+    "# limitations under the License.\n",
+    "# =============================================================================="
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a14466a2",
+   "metadata": {},
+   "source": [
+    "<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
+    "\n",
+    "# TensorFlow Embedding Plugin Benchmark\n",
+    "\n",
+    "In this notebook, we will benchmark the performance of the Merlin Sparse Operation Kit (SOK) TensorFlow embedding plugin. We will compare it with an equivalent TensorFlow implementation.\n",
+    "\n",
+    "## Requirement\n",
+    "\n",
+    "This notebook is designed to run with the Merlin Tensorflow docker image nvcr.io/nvidia/merlin/merlin-tensorflow-training:0.6, which can be obtained from the NVIDIA GPU cloud [Merlin page](https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-tensorflow-training).\n",
+    "\n",
+    "```\n",
+    "git clone https://github.com/NVIDIA/HugeCTR\n",
+    "cd HugeCTR\n",
+    "docker run --rm -it --net=host --gpus=all -v $PWD:/workspace nvcr.io/nvidia/merlin/merlin-tensorflow-training:0.6 bash\n",
+    "```\n",
+    "\n",
+    "Then from within the container, start the Jupyter notebook server with:\n",
+    "\n",
+    "```\n",
+    "jupyter notebook --ip 0.0.0.0 --allow-root\n",
+    "```\n",
+    "\n",
+    "## Pre-requisite\n",
+    "\n",
+    "We first make sure TensorFlow v2.5 is installed, then compile SOK with default support for NVIDIA Ampere generation GPUs.\n",
+    "In the sequence below, replace `-DSM=80` with:\n",
+    "- `-DSM=70` for Volta,\n",
+    "- `-DSM=75` for Turing.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bcdadcf6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install tensorflow-gpu==2.5.0\n",
+    "!rm -r /workspace/sparse_operation_kit/build\n",
+    "!cd /workspace/sparse_operation_kit && mkdir -p build && cd build &&  cmake -DSM=80 .. && make -j && make install\n",
+    "!pip install cupy-cuda114\n",
+    "\n",
+    "import tensorflow\n",
+    "tensorflow.__version__\n",
+    "\n",
+    "import cupy\n",
+    "cupy.__version__"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6ae9582a",
+   "metadata": {},
+   "source": [
+    "## Dataset\n",
+    "\n",
+    "Next, we generate some synthetic dataset for this test."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1a37f99b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "CMD = \"\"\"python3 gen_data.py \\\n",
+    "    --global_batch_size=65536 \\\n",
+    "    --slot_num=100 \\\n",
+    "    --nnz_per_slot=10 \\\n",
+    "    --iter_num=30 \n",
+    "    \"\"\"\n",
+    "!$CMD"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d069dbc8",
+   "metadata": {},
+   "source": [
+    "We will next split the same dataset into 8 parts, which is more optimal for multi-GPU training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1c0befee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "CMD = \"\"\"python3 split_data.py \\\n",
+    "    --filename=\"./data.file\" \\\n",
+    "    --split_num=8 \\\n",
+    "    --save_prefix=\"./data_\"\n",
+    "    \"\"\"\n",
+    "!$CMD"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0ec12a8c",
+   "metadata": {},
+   "source": [
+    "## Benchmarking TensorFlow model\n",
+    "\n",
+    "We will first benchmark a TensorFlow model on 1 GPU."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "17b6bb78",
+   "metadata": {
+    "scrolled": false
+   },
+   "outputs": [],
+   "source": [
+    "CMD=\"\"\"python3 run_tf.py \\\n",
+    "    --data_filename=\"./data.file\" \\\n",
+    "    --global_batch_size=65536 \\\n",
+    "    --vocabulary_size=8192 \\\n",
+    "    --slot_num=100 \\\n",
+    "    --nnz_per_slot=10 \\\n",
+    "    --num_dense_layers=6 \\\n",
+    "    --embedding_vec_size=4 \\\n",
+    "    --stop_at_iter=30\n",
+    "    \"\"\"\n",
+    "!$CMD"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e4707a4f",
+   "metadata": {},
+   "source": [
+    "## Benchmarking SOK TensorFlow embedding plugin model\n",
+    "\n",
+    "We will next benchmark an equivalent model, but with the SOK TensorFlow embedding plugin, also on 1 GPU."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "193c1b43",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "CMD=\"\"\"mpiexec -n 1 --allow-run-as-root \\\n",
+    "    python3 run_sok_MultiWorker_mpi.py \\\n",
+    "    --data_filename=\"./data.file\" \\\n",
+    "    --global_batch_size=65536 \\\n",
+    "    --max_vocabulary_size_per_gpu=8192 \\\n",
+    "    --slot_num=100 \\\n",
+    "    --nnz_per_slot=10 \\\n",
+    "    --num_dense_layers=6 \\\n",
+    "    --embedding_vec_size=4 \\\n",
+    "    --data_splited=0 \\\n",
+    "    --optimizer=\"adam\"\n",
+    "    \"\"\"\n",
+    "!$CMD\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "18edb6c9",
+   "metadata": {},
+   "source": [
+    "## Benchmarking SOK multi-GPU\n",
+    "\n",
+    "We will next benchmark the same model, but with the SOK TensorFlow embedding plugin on multiple GPUs.\n",
+    "\n",
+    "For a DGX Station A100 with 4 GPUs:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3a144b31",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "CMD=\"\"\"mpiexec -n 4 --allow-run-as-root \\\n",
+    "    python3 run_sok_MultiWorker_mpi.py \\\n",
+    "    --data_filename=\"./data_\" \\\n",
+    "    --global_batch_size=65536 \\\n",
+    "    --max_vocabulary_size_per_gpu=8192 \\\n",
+    "    --slot_num=100 \\\n",
+    "    --nnz_per_slot=10 \\\n",
+    "    --num_dense_layers=6 \\\n",
+    "    --embedding_vec_size=4 \\\n",
+    "    --data_splited=1 \\\n",
+    "    --optimizer=\"adam\"\n",
+    "    \"\"\"\n",
+    "!$CMD"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f19c669e",
+   "metadata": {},
+   "source": [
+    "For the NVIDIA DGX A100 with 8 GPUs:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7652a63b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "CMD=\"\"\"mpiexec -n 8 --allow-run-as-root \\\n",
+    "    python3 run_sok_MultiWorker_mpi.py \\\n",
+    "    --data_filename=\"./data_\" \\\n",
+    "    --global_batch_size=65536 \\\n",
+    "    --max_vocabulary_size_per_gpu=8192 \\\n",
+    "    --slot_num=100 \\\n",
+    "    --nnz_per_slot=10 \\\n",
+    "    --num_dense_layers=6 \\\n",
+    "    --embedding_vec_size=4 \\\n",
+    "    --data_splited=1 \\\n",
+    "    --optimizer=\"adam\"\n",
+    "    --dgx_a100\n",
+    "    \"\"\"\n",
+    "!$CMD"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "19abb355",
+   "metadata": {
+    "scrolled": true
+   },
+   "source": [
+    "## Performance numbers\n",
+    "\n",
+    "On an NVIDIA DGX-Station A100 80GB.\n",
+    "\n",
+    "\n",
+    "| Model\\Averate iteration time                | 1 GPU (ms)  | 4 GPUs (ms) |\n",
+    "|----------------------|--------|--------|\n",
+    "| TensorFlow 2.5       | 1831.1 | N/A      |\n",
+    "| SOK embedding plugin | 233.1  | 77.6 |\n",
+    "\n",
+    "Table 1. Iteration time (ms) on an NVIDIA DGX-Station A100 80GB.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "44dd56d6",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}