Skip to content

Commit bdaa975

Browse files
committedSep 13, 2021
Merge branch 'v21.9-integration-merge-master' into 'v21.9-integration'
Merge origin/master See merge request dl/hugectr/hugectr!453
2 parents 3206995 + fd234b0 commit bdaa975

File tree

1 file changed

+312
-0
lines changed

1 file changed

+312
-0
lines changed
 
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,312 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"id": "cdfec37b",
7+
"metadata": {},
8+
"outputs": [],
9+
"source": [
10+
"# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n",
11+
"#\n",
12+
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
13+
"# you may not use this file except in compliance with the License.\n",
14+
"# You may obtain a copy of the License at\n",
15+
"#\n",
16+
"# http://www.apache.org/licenses/LICENSE-2.0\n",
17+
"#\n",
18+
"# Unless required by applicable law or agreed to in writing, software\n",
19+
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
20+
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
21+
"# See the License for the specific language governing permissions and\n",
22+
"# limitations under the License.\n",
23+
"# =============================================================================="
24+
]
25+
},
26+
{
27+
"cell_type": "markdown",
28+
"id": "a14466a2",
29+
"metadata": {},
30+
"source": [
31+
"<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
32+
"\n",
33+
"# TensorFlow Embedding Plugin Benchmark\n",
34+
"\n",
35+
"In this notebook, we will benchmark the performance of the Merlin Sparse Operation Kit (SOK) TensorFlow embedding plugin. We will compare it with an equivalent TensorFlow implementation.\n",
36+
"\n",
37+
"## Requirement\n",
38+
"\n",
39+
"This notebook is designed to run with the Merlin Tensorflow docker image nvcr.io/nvidia/merlin/merlin-tensorflow-training:0.6, which can be obtained from the NVIDIA GPU cloud [Merlin page](https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-tensorflow-training).\n",
40+
"\n",
41+
"```\n",
42+
"git clone https://github.com/NVIDIA/HugeCTR\n",
43+
"cd HugeCTR\n",
44+
"docker run --rm -it --net=host --gpus=all -v $PWD:/workspace nvcr.io/nvidia/merlin/merlin-tensorflow-training:0.6 bash\n",
45+
"```\n",
46+
"\n",
47+
"Then from within the container, start the Jupyter notebook server with:\n",
48+
"\n",
49+
"```\n",
50+
"jupyter notebook --ip 0.0.0.0 --allow-root\n",
51+
"```\n",
52+
"\n",
53+
"## Pre-requisite\n",
54+
"\n",
55+
"We first make sure TensorFlow v2.5 is installed, then compile SOK with default support for NVIDIA Ampere generation GPUs.\n",
56+
"In the sequence below, replace `-DSM=80` with:\n",
57+
"- `-DSM=70` for Volta,\n",
58+
"- `-DSM=75` for Turing.\n"
59+
]
60+
},
61+
{
62+
"cell_type": "code",
63+
"execution_count": null,
64+
"id": "bcdadcf6",
65+
"metadata": {},
66+
"outputs": [],
67+
"source": [
68+
"!pip install tensorflow-gpu==2.5.0\n",
69+
"!rm -r /workspace/sparse_operation_kit/build\n",
70+
"!cd /workspace/sparse_operation_kit && mkdir -p build && cd build && cmake -DSM=80 .. && make -j && make install\n",
71+
"!pip install cupy-cuda114\n",
72+
"\n",
73+
"import tensorflow\n",
74+
"tensorflow.__version__\n",
75+
"\n",
76+
"import cupy\n",
77+
"cupy.__version__"
78+
]
79+
},
80+
{
81+
"cell_type": "markdown",
82+
"id": "6ae9582a",
83+
"metadata": {},
84+
"source": [
85+
"## Dataset\n",
86+
"\n",
87+
"Next, we generate some synthetic dataset for this test."
88+
]
89+
},
90+
{
91+
"cell_type": "code",
92+
"execution_count": null,
93+
"id": "1a37f99b",
94+
"metadata": {},
95+
"outputs": [],
96+
"source": [
97+
"CMD = \"\"\"python3 gen_data.py \\\n",
98+
" --global_batch_size=65536 \\\n",
99+
" --slot_num=100 \\\n",
100+
" --nnz_per_slot=10 \\\n",
101+
" --iter_num=30 \n",
102+
" \"\"\"\n",
103+
"!$CMD"
104+
]
105+
},
106+
{
107+
"cell_type": "markdown",
108+
"id": "d069dbc8",
109+
"metadata": {},
110+
"source": [
111+
"We will next split the same dataset into 8 parts, which is more optimal for multi-GPU training."
112+
]
113+
},
114+
{
115+
"cell_type": "code",
116+
"execution_count": null,
117+
"id": "1c0befee",
118+
"metadata": {},
119+
"outputs": [],
120+
"source": [
121+
"CMD = \"\"\"python3 split_data.py \\\n",
122+
" --filename=\"./data.file\" \\\n",
123+
" --split_num=8 \\\n",
124+
" --save_prefix=\"./data_\"\n",
125+
" \"\"\"\n",
126+
"!$CMD"
127+
]
128+
},
129+
{
130+
"cell_type": "markdown",
131+
"id": "0ec12a8c",
132+
"metadata": {},
133+
"source": [
134+
"## Benchmarking TensorFlow model\n",
135+
"\n",
136+
"We will first benchmark a TensorFlow model on 1 GPU."
137+
]
138+
},
139+
{
140+
"cell_type": "code",
141+
"execution_count": null,
142+
"id": "17b6bb78",
143+
"metadata": {
144+
"scrolled": false
145+
},
146+
"outputs": [],
147+
"source": [
148+
"CMD=\"\"\"python3 run_tf.py \\\n",
149+
" --data_filename=\"./data.file\" \\\n",
150+
" --global_batch_size=65536 \\\n",
151+
" --vocabulary_size=8192 \\\n",
152+
" --slot_num=100 \\\n",
153+
" --nnz_per_slot=10 \\\n",
154+
" --num_dense_layers=6 \\\n",
155+
" --embedding_vec_size=4 \\\n",
156+
" --stop_at_iter=30\n",
157+
" \"\"\"\n",
158+
"!$CMD"
159+
]
160+
},
161+
{
162+
"cell_type": "markdown",
163+
"id": "e4707a4f",
164+
"metadata": {},
165+
"source": [
166+
"## Benchmarking SOK TensorFlow embedding plugin model\n",
167+
"\n",
168+
"We will next benchmark an equivalent model, but with the SOK TensorFlow embedding plugin, also on 1 GPU."
169+
]
170+
},
171+
{
172+
"cell_type": "code",
173+
"execution_count": null,
174+
"id": "193c1b43",
175+
"metadata": {
176+
"scrolled": true
177+
},
178+
"outputs": [],
179+
"source": [
180+
"CMD=\"\"\"mpiexec -n 1 --allow-run-as-root \\\n",
181+
" python3 run_sok_MultiWorker_mpi.py \\\n",
182+
" --data_filename=\"./data.file\" \\\n",
183+
" --global_batch_size=65536 \\\n",
184+
" --max_vocabulary_size_per_gpu=8192 \\\n",
185+
" --slot_num=100 \\\n",
186+
" --nnz_per_slot=10 \\\n",
187+
" --num_dense_layers=6 \\\n",
188+
" --embedding_vec_size=4 \\\n",
189+
" --data_splited=0 \\\n",
190+
" --optimizer=\"adam\"\n",
191+
" \"\"\"\n",
192+
"!$CMD\n"
193+
]
194+
},
195+
{
196+
"cell_type": "markdown",
197+
"id": "18edb6c9",
198+
"metadata": {},
199+
"source": [
200+
"## Benchmarking SOK multi-GPU\n",
201+
"\n",
202+
"We will next benchmark the same model, but with the SOK TensorFlow embedding plugin on multiple GPUs.\n",
203+
"\n",
204+
"For a DGX Station A100 with 4 GPUs:"
205+
]
206+
},
207+
{
208+
"cell_type": "code",
209+
"execution_count": null,
210+
"id": "3a144b31",
211+
"metadata": {
212+
"scrolled": true
213+
},
214+
"outputs": [],
215+
"source": [
216+
"CMD=\"\"\"mpiexec -n 4 --allow-run-as-root \\\n",
217+
" python3 run_sok_MultiWorker_mpi.py \\\n",
218+
" --data_filename=\"./data_\" \\\n",
219+
" --global_batch_size=65536 \\\n",
220+
" --max_vocabulary_size_per_gpu=8192 \\\n",
221+
" --slot_num=100 \\\n",
222+
" --nnz_per_slot=10 \\\n",
223+
" --num_dense_layers=6 \\\n",
224+
" --embedding_vec_size=4 \\\n",
225+
" --data_splited=1 \\\n",
226+
" --optimizer=\"adam\"\n",
227+
" \"\"\"\n",
228+
"!$CMD"
229+
]
230+
},
231+
{
232+
"cell_type": "markdown",
233+
"id": "f19c669e",
234+
"metadata": {},
235+
"source": [
236+
"For the NVIDIA DGX A100 with 8 GPUs:"
237+
]
238+
},
239+
{
240+
"cell_type": "code",
241+
"execution_count": null,
242+
"id": "7652a63b",
243+
"metadata": {},
244+
"outputs": [],
245+
"source": [
246+
"CMD=\"\"\"mpiexec -n 8 --allow-run-as-root \\\n",
247+
" python3 run_sok_MultiWorker_mpi.py \\\n",
248+
" --data_filename=\"./data_\" \\\n",
249+
" --global_batch_size=65536 \\\n",
250+
" --max_vocabulary_size_per_gpu=8192 \\\n",
251+
" --slot_num=100 \\\n",
252+
" --nnz_per_slot=10 \\\n",
253+
" --num_dense_layers=6 \\\n",
254+
" --embedding_vec_size=4 \\\n",
255+
" --data_splited=1 \\\n",
256+
" --optimizer=\"adam\"\n",
257+
" --dgx_a100\n",
258+
" \"\"\"\n",
259+
"!$CMD"
260+
]
261+
},
262+
{
263+
"cell_type": "markdown",
264+
"id": "19abb355",
265+
"metadata": {
266+
"scrolled": true
267+
},
268+
"source": [
269+
"## Performance numbers\n",
270+
"\n",
271+
"On an NVIDIA DGX-Station A100 80GB.\n",
272+
"\n",
273+
"\n",
274+
"| Model\\Averate iteration time | 1 GPU (ms) | 4 GPUs (ms) |\n",
275+
"|----------------------|--------|--------|\n",
276+
"| TensorFlow 2.5 | 1831.1 | N/A |\n",
277+
"| SOK embedding plugin | 233.1 | 77.6 |\n",
278+
"\n",
279+
"Table 1. Iteration time (ms) on an NVIDIA DGX-Station A100 80GB.\n"
280+
]
281+
},
282+
{
283+
"cell_type": "code",
284+
"execution_count": null,
285+
"id": "44dd56d6",
286+
"metadata": {},
287+
"outputs": [],
288+
"source": []
289+
}
290+
],
291+
"metadata": {
292+
"kernelspec": {
293+
"display_name": "Python 3 (ipykernel)",
294+
"language": "python",
295+
"name": "python3"
296+
},
297+
"language_info": {
298+
"codemirror_mode": {
299+
"name": "ipython",
300+
"version": 3
301+
},
302+
"file_extension": ".py",
303+
"mimetype": "text/x-python",
304+
"name": "python",
305+
"nbconvert_exporter": "python",
306+
"pygments_lexer": "ipython3",
307+
"version": "3.8.10"
308+
}
309+
},
310+
"nbformat": 4,
311+
"nbformat_minor": 5
312+
}

0 commit comments

Comments
 (0)
Please sign in to comment.