Added compression notebook

Xilinx · Jul 26, 2022 · 6e27fd2 · 6e27fd2
1 parent 73e62cb
commit 6e27fd2
Show file tree

Hide file tree

Showing 4 changed files with 191 additions and 2 deletions.
diff --git a/src/pyaccl/accl.py b/src/pyaccl/accl.py
@@ -126,6 +126,17 @@ def __init__(self, uncompressed_elem_bytes, compressed_elem_bytes, elem_ratio_lo
         #address where stored in exchange memory
         self.exchmem_addr = None
 
+    def __str__(self):
+        description = f'Arithmetic Config at address {self.exchmem_addr}\n'
+        description += f'Uncompressed dtype B/element: {self.uncompressed_elem_bytes}\n'
+        description += f'Compressed dtype B/element: {self.compressed_elem_bytes}\n'
+        description += f'Ratio of number of compressed to uncompressed elements: {2**self.elem_ratio_log}\n'
+        description += f'Perform arithmetic on compressed dtype: {"True" if self.arith_is_compressed else "False"}\n'
+        description += f'Compressor ID: {self.compressor_tdest}\n'
+        description += f'Decompressor ID: {self.decompressor_tdest}\n'
+        description += f'Reduction function IDs: {self.arith_tdest}\n'
+        return description
+
     @property
     def addr(self):
         assert self.exchmem_addr is not None

diff --git a/src/pyaccl/notebooks/collectives.ipynb b/src/pyaccl/notebooks/collectives.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# ACCL Collectives (Emulator/Simulator)\n",
+    "# ACCL Collectives\n",
     "In a system of more than one ACCL-enabled FPGAs, we can execute MPI-like collectives (scatter, gather, broadcast, reductions, etc). This notebook illustrates how to initialize the ACCL instances and run collectives. Usually, each ACCL instance runs in a separate process on a distinct compute node in a network, but for purposes of demonstration, we utilize multithreading in a single process to create and operate multiple ACCL instances"
    ]
   },

diff --git a/src/pyaccl/notebooks/compression.ipynb b/src/pyaccl/notebooks/compression.ipynb
@@ -0,0 +1,178 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# ACCL Compression Support\n",
+    "In general, ACCL is datatype-agnostic as most of its function involves data movement without any actual interaction with the data values themselves. However, there are two exceptions to this rule:\n",
+    "* Elementwise operations (e.g. SUM) performed by ACCL instances on buffers during reduction-type collectives \n",
+    "* Datatype conversions when the source and destination of a transfer are of different data types. In this scenario we call the lower-precision buffer is compressed.\n",
+    "\n",
+    "To support these elementwise operations and conversions, ACCL must be configured with a reduction plugin and coversion plugins respectively. Each of these plugins is a free-running Vitis kernel. Reduction plugins take two operand AXI Streams and produce one result AXI Stream, and may implement multiple functions internally, selected by an operation ID provided as side-band to the operands on the TDEST signal of AXI Stream. Conversion plugins take one operand AXI Stream as input and produce a result AXI Stream by applying an arbitrary conversion function specified by a function ID on the operand's TDEST. \n",
+    "\n",
+    "Example reduction and conversion plugins are provided in the ACCL repo. The example reduction plugin supports five data types: FP16/32/64 and INT32/64. The example compression plugin converts between floating-point single-precision (FP32) and half-precision (FP16). Together, these plugins enable six datatype configurations - let's see what they are:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyaccl.accl import ACCL_DEFAULT_ARITH_CONFIG\n",
+    "\n",
+    "for key in ACCL_DEFAULT_ARITH_CONFIG:\n",
+    "    print(f\"Uncompressed dtype: {key[0]}\\nCompressed dtype: {key[1]}\\n{str(ACCL_DEFAULT_ARITH_CONFIG[key])}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Five of these configurations are homogeneous, i.e. operate on buffers of identical data types. One is heterogeneous and can operate on combinations of FP32 and FP16 buffers, e.g. source buffers of a primitive can be FP32 and results FP16 or vice-versa, by utilizing the conversion plugin. \n",
+    "\n",
+    "The key points of the ACCL datatype configuration are:\n",
+    "* bytes per element for the compressed and uncompressed datatype. In the case of homogeneous configurations, these are the same datatype.\n",
+    "* ratio of compressed elements to uncompressed elements, i.e. how many uncompressed buffer elements are consumed in the conversion process to produce one compressed element. For elementwise conversion e.g. FP32 to FP16, this ratio is 1. For block floating point formats, this ratio could be higher.\n",
+    "* whether arithmetic should be performed on the compressed data - for higher throughput - or uncompressed data - for higher precision. ACCL determines the order of conversions required to meet this specifications for each primitive and collective. \n",
+    "* function IDs to be provided to the plugins when performing compression, decompression, and reduction.\n",
+    "\n",
+    "Notice that in the ACCL default FP32/FP16 compression configuration, arithmetic is perfomed on the lower-precision FP16 datatype. Let's initialize two ACCL instances and see how the FP16 compression feature works."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "RUN_ON_HARDWARE = True\n",
+    "XCLBIN = \"axis3x.xclbin\"\n",
+    "\n",
+    "from pyaccl import accl\n",
+    "\n",
+    "if RUN_ON_HARDWARE:\n",
+    "    accl0 = accl(2, 0, xclbin=XCLBIN, cclo_idx=0)\n",
+    "    accl1 = accl(2, 1, xclbin=XCLBIN, cclo_idx=1)\n",
+    "else:\n",
+    "    accl0 = accl(2, 0, sim_mode=True)\n",
+    "    accl1 = accl(2, 1, sim_mode=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Operating on buffers of different data types\n",
+    "\n",
+    "Let's do a reduction between a two NumPy FP32 bufferas, with the result stored in a FP16 buffer. First we'll allocate these buffers using the `dtype` optional argument to `allocate()`, paint the buffers with high-precision data, then perform the local reduction. Since in this mixed-precision scenario ACCL  by default performs arithmetic in FP16, the sum-combine is equivalent to the following sequence of operations:\n",
+    "1. convert `op0` and `op1` to FP16\n",
+    "2. perform the sum in FP16\n",
+    "3. store the result in `res`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from pyaccl import ACCLReduceFunctions\n",
+    "\n",
+    "op0 = accl0.allocate((10,), dtype=np.float32)\n",
+    "op0[:] = [np.pi*i for i in range(10)]\n",
+    "op1 = accl0.allocate((10,), dtype=np.float32)\n",
+    "op1[:] = [1.1*i for i in range(10)]\n",
+    "res = accl0.allocate((10,), dtype=np.float16)\n",
+    "\n",
+    "accl0.combine(10, ACCLReduceFunctions.SUM, op0, op1, res)\n",
+    "\n",
+    "print(op0+op1)\n",
+    "print((op0+op1).astype(np.float16))\n",
+    "print((op0.astype(np.float16)+op1.astype(np.float16)).astype(np.float16))\n",
+    "print(res)\n",
+    "np.sum(np.abs(np.subtract(op0+op1, res)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Notice how the result is slightly different depending on whether we perform the sum in FP32 and FP16. The ACCL result is slightly different than the NumPy result due to differences in the underlying floating point ALUs on the FPGA and CPU respectively."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Compressing data over the wire\n",
+    "\n",
+    "In addition to local conversions, users can specify FP16 compression for traffic across the backend (typically Ethernet) link between ACCL instances even when all buffers are FP32. This feature reduces network traffic and latency, but as expected, there is a loss of precision of data during transport. Let's compress data for a simple send-receive pair. We need to utilize the `compress_dtype` optional argument for both `send()` and `recv()`. Please note that the compression settings must match for the receive operation to identify the received buffer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "src = accl0.allocate((10,))\n",
+    "dst = accl1.allocate((10,))\n",
+    "src[:] = [1.1111*i for i in range(10)]\n",
+    "dst[:] = [0.0 for i in range(10)]\n",
+    "\n",
+    "accl0.send(src, len(src), 1, tag=0, compress_dtype=np.dtype('float16'))\n",
+    "accl1.recv(dst, len(dst), 0, tag=0, compress_dtype=np.dtype('float16'))\n",
+    "\n",
+    "print(src)\n",
+    "print(dst)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## De-Initialize ACCL instances\n",
+    "The `deinit()` function clears all internal data structures in the ACCL instance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "accl0.deinit()\n",
+    "accl1.deinit()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.8.10 64-bit",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/src/pyaccl/notebooks/primitives.ipynb b/src/pyaccl/notebooks/primitives.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# ACCL Primitives (Emulator/Simulator)\n",
+    "# ACCL Primitives\n",
     "ACCL primitives are a set of simple operations that an ACCL instance can execute and assemble into larger operations such as collectives. The primitives are:\n",
     "* Copy - a simple DMA operation from a local source buffer to a local destination buffer\n",
     "* Combine - applying a binary elementwise operator to two source buffers and placing the result in the destination buffer\n",