Added async example; small fixes

Xilinx · Aug 3, 2022 · df1ddd0 · df1ddd0
1 parent dbc0fa0
commit df1ddd0
Show file tree

Hide file tree

Showing 2 changed files with 38 additions and 6 deletions.
diff --git a/src/pyaccl/notebooks/communicators.ipynb b/src/pyaccl/notebooks/communicators.ipynb
@@ -201,7 +201,7 @@
   },
   "vscode": {
    "interpreter": {
-    "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
+    "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a"
    }
   }
  },

diff --git a/src/pyaccl/notebooks/performance.ipynb b/src/pyaccl/notebooks/performance.ipynb
@@ -9,15 +9,16 @@
     "\n",
     "There are several factors influencing the duration of an ACCL API call:\n",
     "* the complexity of a call - a copy will be faster than an all-reduce for example\n",
-    "* the size (in bytes) of communicated buffers\n",
-    "* memory contention between sending and receiving processes. ACCL can be configured in specific ways to minimize this contention, as we will see\n",
+    "* the size (in bytes) of communicated buffers and their location in the memory hierarchy\n",
+    "* memory contention between sending and receiving processes. ACCL can be configured in specific ways to minimize this contention\n",
+    "* use of blocking or non-blocking variants of the API calls\n",
     "* network performance, which in itself might depend on the size of buffers i.e. very small buffers typically lead to low utilization of Ethernet bandwidth\n",
     "\n",
     "Factors which should not influence runtime are:\n",
     "* data type - API calls on buffers of the same byte size should take the same amount of time, even if the buffers themselves differ in datatype and number of elements \n",
     "* use of compression - ACCL is designed to perform compression at network rate\n",
     "\n",
-    "One thing to note here is that every ACCL primitive or collective assumes your source and destination buffers are in host memory, onless otherwise specified with the `from_fpga` and `to_fpga` optional arguments that most API calls take. As such, before the operation is initiated, the source data is moved to the FPGA device memory, and after it completes, the resulting data is moved back to host memory. These copies have a performance overhead which typically depends on the size of copied buffers."
+    "Let's initialize a few ACCL instances and explore two performance-related aspects of the API."
    ]
   },
   {
@@ -68,8 +69,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Host vs. FPGA buffers\n",
-    "The location of source data for an ACCL API call, as well as the location of produced data, can be very important to performance. Let's start by profiling the execution of the copy, the simplest primitive. We will measure across a range of buffer sizes. Feel free to change the `timeit` parameters."
+    "## Host vs. FPGA buffers\n",
+    "\n",
+    "Every ACCL primitive or collective assumes your source and destination buffers are in host memory, unless otherwise specified with the `from_fpga` and `to_fpga` optional arguments that most PyACCL calls take. As such, before the operation is initiated, the source data is moved to the FPGA device memory, and after it completes, the resulting data is moved back to host memory. These copies have a performance overhead which typically depends on the size of copied buffers. \n",
+    "\n",
+    "Let's start by profiling the execution of the copy, the simplest primitive. We will measure across a range of buffer sizes. Feel free to change the `timeit` parameters."
    ]
   },
   {
@@ -119,6 +123,34 @@
     "%timeit -r 4 -n 10 accl_instances[0].copy(op0_buf_fp16[0], res_buf_fp16[0], RXBUF_SIZE/2, from_fpga=True, to_fpga=True)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Asynchronous calls\n",
+    "Some PyACCL calls take the `async` optional argument. If this is set to true, the function call immediately returns a handle to a Python future object which can be waited on to determine if the processing has actually finished. This enables the program to continue processing on the host while the ACCL call is being executed in the FPGA.\n",
+    "\n",
+    "We can experiment with this feature by emulating host-side work with calls to `time.sleep()`. As long as the call to ACCL takes longer than the call to `sleep()`, the sleep will be completely hidden behind the ACCL call."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "\n",
+    "def overlap_computation(count):\n",
+    "    handle = accl_instances[0].copy(op0_buf[0], res_buf[0], count, from_fpga=True, to_fpga=True, run_async=True)\n",
+    "    time.sleep(0.1)\n",
+    "    handle.wait()\n",
+    "\n",
+    "%timeit -r 4 -n 10 overlap_computation(1)\n",
+    "%timeit -r 4 -n 10 overlap_computation(1024/4)\n",
+    "%timeit -r 4 -n 10 overlap_computation(RXBUF_SIZE/4)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},