404
+ +Page not found
+ + +diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml deleted file mode 100644 index 4fa8eab..0000000 --- a/.github/workflows/deploy.yml +++ /dev/null @@ -1,18 +0,0 @@ -name: build -on: - push: - branches: - - master -jobs: - deploy: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v2 - - uses: actions/setup-python@v2 - with: - python-version: 3.x - - run: pip install mkdocs - - run: pip install python-markdown-math - - run: pip install pymdown-extensions - - run: pip install pygments - - run: mkdocs gh-deploy --force --clean --verbose diff --git a/.gitignore b/.gitignore deleted file mode 100644 index 90cee36..0000000 --- a/.gitignore +++ /dev/null @@ -1,3 +0,0 @@ -*.swp -*.DS_Store -site/ diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/404.html b/404.html new file mode 100644 index 0000000..913e4a3 --- /dev/null +++ b/404.html @@ -0,0 +1,141 @@ + + +
+ + + + +Page not found
+ + +The performance of Python applications that use TACO can be measured using
+Python's built-in time.perf_counter
function with minimal changes to the
+applications. As an example, we can benchmark the performance of the
+scientific computing application shown here as
+follows:
import pytaco as pt
+from pytaco import compressed, dense
+import numpy as np
+import time
+
+csr = pt.format([dense, compressed])
+dv = pt.format([dense])
+
+A = pt.read("pwtk.mtx", csr)
+x = pt.from_array(np.random.uniform(size=A.shape[1]))
+z = pt.from_array(np.random.uniform(size=A.shape[0]))
+y = pt.tensor([A.shape[0]], dv)
+
+i, j = pt.get_index_vars(2)
+y[i] = A[i, j] * x[j] + z[i]
+
+# Tell TACO to generate code to perform the SpMV computation
+y.compile()
+
+# Benchmark the actual SpMV computation
+start = time.perf_counter()
+y.compute()
+end = time.perf_counter()
+
+print("Execution time: {0} seconds".format(end - start))
+In order to accurately measure TACO's computational performance, only the
+time it takes to actually perform a computation should be measured. The time
+it takes to generate code under the hood for performing that computation should
+not be measured, since this overhead can be quite variable but can often be
+amortized in practice. By default though, TACO will only generate and compile
+code it needs for performing a computation immediately before it has to
+actually perform the computation. As the example above demonstrates, by
+manually calling the result tensor's compile
method, we can tell TACO to
+generate code needed for performing the computation before benchmarking starts,
+letting us measure only the performance of the computation itself.
Warning
+pytaco.evaluate
and pytaco.einsum
should not be used to benchmark
+TACO's computational performance, since timing those functions will
+include the time it takes to generate code for performing the computation.
The time it takes to construct the initial operand tensors should also not be
+measured, since again this overhead can often be amortized in practice. By
+default, pytaco.read
and functions for converting NumPy arrays and SciPy
+matrices to TACO tensors return fully constructed tensors. If you add nonzero
+elements to an operand tensor by invoking its insert
method though, then
+pack
must also be explicitly invoked before any benchmarking is done:
import pytaco as pt
+from pytaco import compressed, dense
+import numpy as np
+import random
+import time
+
+csr = pt.format([dense, compressed])
+dv = pt.format([dense])
+
+A = pt.read("pwtk.mtx", csr)
+x = pt.tensor([A.shape[1]], dv)
+z = pt.tensor([A.shape[0]], dv)
+y = pt.tensor([A.shape[0]], dv)
+
+# Insert random values into x and z and pack them into dense arrays
+for k in range(A.shape[1]):
+ x.insert([k], random.random())
+x.pack()
+for k in range(A.shape[0]):
+ z.insert([k], random.random())
+z.pack()
+
+i, j = pt.get_index_vars(2)
+y[i] = A[i, j] * x[j] + z[i]
+
+y.compile()
+
+start = time.perf_counter()
+y.compute()
+end = time.perf_counter()
+
+print("Execution time: {0} seconds".format(end - start))
+TACO avoids regenerating code for performing the same computation though as +long as the computation is redefined with the same index variables and with the +same operand and result tensors. Thus, if your application executes the same +computation many times in a loop and if the computation is executed on +sufficiently large data sets, TACO will naturally amortize the overhead +associated with generating code for performing the computation. In such +scenarios, it is acceptable to include the initial code generation overhead +in the performance measurement:
+import pytaco as pt
+from pytaco import compressed, dense
+import numpy as np
+import time
+
+csr = pt.format([dense, compressed])
+dv = pt.format([dense])
+
+A = pt.read("pwtk.mtx", csr)
+x = pt.tensor([A.shape[1]], dv)
+z = pt.tensor([A.shape[0]], dv)
+y = pt.tensor([A.shape[0]], dv)
+
+for k in range(A.shape[1]):
+ x.insert([k], random.random())
+x.pack()
+for k in range(A.shape[0]):
+ z.insert([k], random.random())
+z.pack()
+
+i, j = pt.get_index_vars(2)
+
+# Benchmark the iterative SpMV computation, including overhead for
+# generating code in the first iteration to perform the computation
+start = time.perf_counter()
+for k in range(1000):
+ y[i] = A[i, j] * x[j] + z[i]
+ y.evaluate()
+ x[i] = y[i]
+ x.evaluate()
+end = time.perf_counter()
+
+print("Execution time: {0} seconds".format(end - start))
+Warning
+In order to avoid regenerating code for performing a computation, the
+computation must be redefined with the exact same index variable objects
+and also with the exact same tensor objects for operands and result. In
+the example above, every loop iteration redefines the computation of y
+and x
using the same tensor and index variable objects costructed outside
+the loop, so TACO will only generate code to compute y
and x
in the
+first iteration. If the index variables were constructed inside the loop
+though, TACO would regenerate code to compute y
and x
in every loop
+iteration, and the compilation overhead would not be amortized.
Note
+As a rough rule of thumb, if a computation takes on the order of seconds or +more in total to perform across all invocations with identical operands and +result (and is always redefined with identical index variables), then it is +acceptable to include the overhead associated with generating code for +performing the computation in performance measurements.
+Tensor algebra computations can be expressed in TACO with tensor index +notation, which at a high level describes how each element in the output tensor +can be computed from elements in the input tensors. As an example, matrix +addition can be expressed in index notation as
+A(i,j) = B(i,j) + C(i,j)
+where A
, B
, and C
denote order-2 tensors (i.e. matrices) while i
and
+j
are index variables that represent abstract indices into the corresponding
+dimensions of the tensors. In words, the example above essentially states that,
+for every i
and j
, the element in the i
-th row and j
-th column of the
+A
should be assigned the sum of the corresponding elements in B
and C
.
+Similarly, element-wise multiplication of three order-3 tensors can be
+expressed in index notation as follows
A(i,j,k) = B(i,j,k) * C(i,j,k) * D(i,j,k)
+The syntax shown above corresponds to exactly what you would have to write in +C++ with TACO to define tensor algebra computations. Note, however, that prior +to defining a tensor algebra computation, all index variables have to be +declared. This can be done as shown below:
+IndexVar i, j, k; // Declare index variables for previous example
+In both of the previous examples, all of the index variables are used to index +into both the output and the inputs. However, it is possible for an index +variable to be used to index into the inputs only, in which case the index +variable is reduced (summed) over. For instance, the following example
+y(i) = A(i,j) * x(j)
+can be rewritten with the summation more explicit as and demonstrates how matrix-vector multiplication can be expressed +in index notation.
+Note that, in TACO, reductions are assumed to be over the smallest +subexpression that captures all uses of the corresponding reduction variable. +For instance, the following computation
+y(i) = A(i,j) * x(j) + z(i)
+can be rewritten with the summation more explicit as
++ +
+whereas the following computation
+y(i) = A(i,j) * x(j) + z(j)
+can be rewritten with the summation more explicit as
++ +
+Once a tensor algebra computation has been defined (and all of the inputs have
+been initialized), you can simply invoke the
+output tensor's evaluate
method to perform the actual computation:
A.evaluate(); // Perform the computation defined previously for output tensor A
+Under the hood, when you invoke the evaluate
method, TACO first invokes the
+output tensor's compile
method to generate kernels that assembles the output
+indices (if the tensor contains any sparse dimensions) and that performs the
+actual computation. TACO would then call the two generated kernels by invoking
+the output tensor's assemble
and compute
methods. You can manually invoke
+these methods instead of calling evaluate
as demonstrated below:
A.compile(); // Generate output assembly and compute kernels
+A.assemble(); // Invoke the output assembly kernel to assemble the output indices
+A.compute(); // Invoke the compute kernel to perform the actual computation
+This can be useful if you want to perform the same computation multiple times,
+in which case it suffices to invoke compile
once before the first time the
+computation is performed.
It is also possible to compute on tensors without having to explicitly invoke
+compile
, assemble
, or compute
. Once you attempt to modify or view the
+output of a computation, TACO would automatically invoke those methods if
+necessary in order to compute the values in the output tensor. If the input to
+a computation is itself the output of another computation, then TACO would also
+automatically ensure that the latter computation is fully executed first.
When using the TACO C++ library, the typical usage is to declare your input
+taco::Tensor
structures, then add data to these structures using the insert
+method. This is wasteful if the data is already loaded into memory in a
+compatible format; TACO can use this data directly without copying it. Below
+are some usage examples for common situations where a user may want to do this.
A two-dimensional CSR matrix can be created using three arrays:
+rowptr
(array of int
): list of indices in colidx
representing starts of rowscolidx
(array of int
): list of column indices of non-zero valuesvals
(array of T
for Tensor<T>
): list of non-zero values corresponding to columns in colidx
The taco::makeCSR<T>
function takes these arrays and creates a
+taco::Tensor<T>
. The following example constructs a 5x10 matrix populated
+with a few values.
int *rowptr = new int[6]{0, 2, 4, 4, 4, 7};
+int *colidx = new int[7]{3, 5, 0, 7, 7, 8, 9};
+double *values = new double[7]{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7};
+Tensor<double> A = makeCSR("A", {5, 10}, rowptr, colidx, values);
+Similarly, a two-dimensional CSC matrix can be created from the appropriate
+arrays using the taco::makeCSC<T>
function. This example constructs the same
+5x10 matrix from the CSR example above, but in CSC format.
int *colptr = new int[11]{0, 1, 1, 1, 2, 2, 3, 3, 5, 6, 7};
+int *rowidx = new int[7]{1, 0, 0, 1, 4, 4, 4};
+double *values = new double[7]{0.3, 0.1, 0.2, 0.4, 0.5, 0.6, 0.7};
+Tensor<double> B = makeCSC("B", {5, 10}, colptr, rowidx, values);
+For single-dimension dense vectors, you can use an array of values (of type T
+for a Tensor<T>
). There is no helper function for this (like makeCSR
or
+makeCSC
), but it can be done. This example constructs a 1x10 dense vector.
// Create an array of double values.
+double *x_values = new double[10];
+for (int i = 0; i < 10; i++) {
+ x_values[i] = i;
+}
+
+// Create the Tensor and set its storage to our array of values.
+Tensor<double> x({10}, Dense);
+Array x_array = makeArray<double>(x_values, 10);
+TensorStorage x_storage = x.getStorage();
+x_storage.setValues(x_array);
+x.setStorage(x_storage);
+
+ Matricized tensor times Khatri-Rao product (MTTKRP) is a bottleneck operation +in various algorithms - such as Alternating Least Squares - for computing +sparse tensor factorizations like the Canonical Polyadic Decomposition. +Mathematically, mode-1 MTTKRP (for three-dimensional tensors) can be expressed +as
++ +
+where , , and are typically dense matrices, is a +three-dimensional tensor (matricizied along the first mode), and +denotes the Khatri-Rao product. This operation can also be expressed in index +notation as
++ +
+You can use the TACO C++ API to easily and efficiently compute the MTTKRP, as +shown here: +
// On Linux and MacOS, you can compile and run this program like so:
+// g++ -std=c++11 -O3 -DNDEBUG -DTACO -I ../../include -L../../build/lib mttkrp.cpp -o mttkrp -ltaco
+// LD_LIBRARY_PATH=../../build/lib ./mttkrp
+#include <random>
+#include "taco.h"
+using namespace taco;
+int main(int argc, char* argv[]) {
+ std::default_random_engine gen(0);
+ std::uniform_real_distribution<double> unif(0.0, 1.0);
+ // Predeclare the storage formats that the inputs and output will be stored as.
+ // To define a format, you must specify whether each dimension is dense or
+ // sparse and (optionally) the order in which dimensions should be stored. The
+ // formats declared below correspond to compressed sparse fiber (csf) and
+ // row-major dense (rm).
+ Format csf({Sparse,Sparse,Sparse});
+ Format rm({Dense,Dense});
+
+ // Load a sparse order-3 tensor from file (stored in the FROSTT format) and
+ // store it as a compressed sparse fiber tensor. The tensor in this example
+ // can be download from: http://frostt.io/tensors/nell-2/
+ Tensor<double> B = read("nell-2.tns", csf);
+ // Generate a random dense matrix and store it in row-major (dense) format.
+ // Matrices correspond to order-2 tensors in taco.
+ Tensor<double> C({B.getDimension(1), 25}, rm);
+ for (int i = 0; i < C.getDimension(0); ++i) {
+ for (int j = 0; j < C.getDimension(1); ++j) {
+ C.insert({i,j}, unif(gen));
+ }
+ }
+ C.pack();
+
+
+ // Generate another random dense matrix and store it in row-major format.
+ Tensor<double> D({B.getDimension(2), 25}, rm);
+ for (int i = 0; i < D.getDimension(0); ++i) {
+ for (int j = 0; j < D.getDimension(1); ++j) {
+ D.insert({i,j}, unif(gen));
+ }
+ }
+ D.pack();
+
+ // Declare the output matrix to be a dense matrix with 25 columns and the same
+ // number of rows as the number of slices along the first dimension of input
+ // tensor B, to be also stored as a row-major dense matrix.
+ Tensor<double> A({B.getDimension(0), 25}, rm);
+
+
+ // Define the MTTKRP computation using index notation.
+ IndexVar i, j, k, l;
+ A(i,j) = B(i,k,l) * D(l,j) * C(k,j);
+ // At this point, we have defined how entries in the output matrix should be
+ // computed from entries in the input tensor and matrices but have not actually
+ // performed the computation yet. To do so, we must first tell taco to generate
+ // code that can be executed to compute the MTTKRP operation.
+ A.compile();
+ // We can now call the functions taco generated to assemble the indices of the
+ // output matrix and then actually compute the MTTKRP.
+ A.assemble();
+ A.compute();
+ // Write the output of the computation to file (stored in the FROSTT format).
+ write("A.tns", A);
+}
+You can also use the TACO Python API to perform the same computation, as +demonstrated here:
+import pytaco as pt
+import numpy as np
+from pytaco import compressed, dense
+
+# Define formats for storing the sparse tensor and dense matrices
+csf = pt.format([compressed, compressed, compressed])
+rm = pt.format([dense, dense])
+
+# Load a sparse three-dimensional tensor from file (stored in the FROSTT
+# format) and store it as a compressed sparse fiber tensor. The tensor in this
+# example can be download from: http://frostt.io/tensors/nell-2/
+B = pt.read("nell-2.tns", csf);
+
+# Generate two random matrices using NumPy and pass them into TACO
+C = pt.from_array(np.random.uniform(size=(B.shape[1], 25)))
+D = pt.from_array(np.random.uniform(size=(B.shape[2], 25)))
+
+# Declare the result to be a dense matrix
+A = pt.tensor([B.shape[0], 25], rm)
+
+# Declare index vars
+i, j, k, l = get_index_vars(4)
+
+# Define the MTTKRP computation
+A[i, j] = B[i, k, l] * D[l, j] * C[k, j]
+
+# Perform the MTTKRP computation and write the result to file
+pt.write("A.tns", A)
+When you run the above Python program, TACO will generate code under the hood +that efficiently performs the computation in one shot. This lets TACO avoid +materializing the intermediate Khatri-Rao product, thus reducing the amount of +memory accesses and speeding up the computation.
+ +TACO is a library for performing sparse and +dense linear algebra and tensor algebra computations. The computations can +range from relatively simple ones like sparse matrix-vector multiplication to +more complex ones like matricized tensor times Khatri-Rao product. All these +computations can be performed on any mix of dense and sparse tensors. Under the +hood, TACO automatically generates efficient code to perform these +computations.
+The sidebar to the left links to documentation for the TACO C++ and Python +APIs as well as some examples demonstrating how TACO can be used in +real-world applications.
+Questions and bug reports can be submitted here.
+ +Sampled dense-dense matrix product (SDDMM) is a bottleneck operation in many +factor analysis algorithms used in machine learning, including Alternating +Least Squares and Latent Dirichlet Allocation. Mathematically, the operation +can be expressed as
++ +
+where and are sparse matrices, and are dense matrices, +and denotes component-wise multiplication. This operation can also be +expressed in index +notation as
++ +
+You can use the TACO C++ API to easily and efficiently compute the SDDMM, as +shown here:
+// On Linux and MacOS, you can compile and run this program like so:
+// g++ -std=c++11 -O3 -DNDEBUG -DTACO -I ../../include -L../../build/lib sddmm.cpp -o sddmm -ltaco
+// LD_LIBRARY_PATH=../../build/lib ./sddmm
+#include <random>
+#include "taco.h"
+using namespace taco;
+int main(int argc, char* argv[]) {
+ std::default_random_engine gen(0);
+ std::uniform_real_distribution<double> unif(0.0, 1.0);
+ // Predeclare the storage formats that the inputs and output will be stored as.
+ // To define a format, you must specify whether each dimension is dense or sparse
+ // and (optionally) the order in which dimensions should be stored. The formats
+ // declared below correspond to doubly compressed sparse row (dcsr), row-major
+ // dense (rm), and column-major dense (dm).
+ Format dcsr({Sparse,Sparse});
+ Format rm({Dense,Dense});
+ Format cm({Dense,Dense}, {1,0});
+
+ // Load a sparse matrix from file (stored in the Matrix Market format) and
+ // store it as a doubly compressed sparse row matrix. Matrices correspond to
+ // order-2 tensors in taco. The matrix in this example can be download from:
+ // https://www.cise.ufl.edu/research/sparse/MM/Williams/webbase-1M.tar.gz
+ Tensor<double> B = read("webbase-1M.mtx", dcsr);
+ // Generate a random dense matrix and store it in row-major (dense) format.
+ Tensor<double> C({B.getDimension(0), 1000}, rm);
+ for (int i = 0; i < C.getDimension(0); ++i) {
+ for (int j = 0; j < C.getDimension(1); ++j) {
+ C.insert({i,j}, unif(gen));
+ }
+ }
+ C.pack();
+
+ // Generate another random dense matrix and store it in column-major format.
+ Tensor<double> D({1000, B.getDimension(1)}, cm);
+ for (int i = 0; i < D.getDimension(0); ++i) {
+ for (int j = 0; j < D.getDimension(1); ++j) {
+ D.insert({i,j}, unif(gen));
+ }
+ }
+ D.pack();
+
+ // Declare the output matrix to be a sparse matrix with the same dimensions as
+ // input matrix B, to be also stored as a doubly compressed sparse row matrix.
+ Tensor<double> A(B.getDimensions(), dcsr);
+
+ // Define the SDDMM computation using index notation.
+ IndexVar i, j, k;
+ A(i,j) = B(i,j) * C(i,k) * D(k,j);
+
+ // At this point, we have defined how entries in the output matrix should be
+ // computed from entries in the input matrices but have not actually performed
+ // the computation yet. To do so, we must first tell taco to generate code that
+ // can be executed to compute the SDDMM operation.
+ A.compile();
+ // We can now call the functions taco generated to assemble the indices of the
+ // output matrix and then actually compute the SDDMM.
+ A.assemble();
+ A.compute();
+ // Write the output of the computation to file (stored in the Matrix Market format).
+ write("A.mtx", A);
+}
+You can also use the TACO Python API to perform the same computation, as +demonstrated here:
+import pytaco as pt
+from pytaco import dense, compressed
+import numpy as np
+
+# Define formats that the inputs and output will be stored as. To define a
+# format, you must specify whether each dimension is dense or sparse and
+# (optionally) the order in which dimensions should be stored. The formats
+# declared below correspond to doubly compressed sparse row (dcsr), row-major
+# dense (rm), and column-major dense (dm).
+dcsr = pt.format([compressed, compressed])
+rm = pt.format([dense, dense])
+cm = pt.format([dense, dense], [1, 0])
+
+# The matrix in this example can be download from:
+# https://www.cise.ufl.edu/research/sparse/MM/Williams/webbase-1M.tar.gz
+B = pt.read("webbase-1M.mtx", dcsr)
+
+# Generate two random matrices using NumPy and pass them into TACO
+x = pt.from_array(np.random.uniform(size=(B.shape[0], 1000)))
+z = pt.from_array(np.random.uniform(size=(1000, B.shape[1])), out_format=cm)
+
+# Declare the result to be a doubly compressed sparse row matrix
+A = pt.tensor(B.shape, dcsr)
+
+# Declare index vars
+i, j, k = pt.get_index_vars(3)
+
+# Define the SDDMM computation
+A[i, j] = B[i, j] * C[i, k] * D[k, j]
+
+# Perform the SDDMM computation and write the result to file
+pt.write("A.mtx", A)
+When you run the above Python program, TACO will generate code under the hood +that efficiently performs the computation in one shot. This lets TACO only +compute elements of the intermediate dense matrix product that are actually +needed to compute the result, thus reducing the asymptotic complexity of the +computation.
+ +This section describes various strategies for improving the performace of +applications that use TACO to perform linear and tensor algebra computations.
+TACO supports storing tensors in a wide range of formats, including many +commonly used ones like dense arrays, compressed sparse row (CSR), and +compressed sparse fiber (CSF). Using the right formats to store a sparse +computation's operands and result can not only reduce the amount of memory +needed to perform the computation but also improve its performance. In +particular, by selecting formats that accurately describe the sparsity and +structure of the operands, TACO can generate code under the hood that exploits +these properties of the data to avoid redundantly computing with zero elements +and thus speed up a computation.
+As previously explained, TACO uses a +novel scheme that describes different tensor storage formats by specifying +whether each dimension is sparse or dense. A dense dimension indicates to TACO +that most if not all slices of the tensor along that dimension contain at least +one nonzero element. So if every element in a matrix is nonzero, we can make +that explicit by storing the matrix in a format where both dimensions are +dense, which indicates that every row is nonempty and that every column in each +row stores a nonzero element:
+pytaco.format([pytaco.dense, pytaco.dense]) # a.k.a. a dense array
+A sparse dimension, on the other hand, indicates to TACO that most slices of +the tensor along that dimension contain only zeros. So if relatively few rows +of a matrix is nonempty and if relatively few columns in each nonempty row +store nonzero elements, we can also make that explicit by storing the matrix in +a format where both dimensions are sparse:
+pytaco.format([pytaco.compressed, pytaco.compressed]) # a.k.a. a DCSR matrix
+Tip
+Storing a tensor dimension as a sparse dimension incurs overhead that is +proportional to the number of nonempty slices along that dimension, so only +do so if most slices are actually empty. Otherwise, it is more appropriate +to store the dimension as a dense dimension.
+It is easy to define custom formats for storing tensors with complex +sparsity structures. For example, let's say we have a three-dimensional +tensor that has no empty slice along the dimension, and let's +say that each row in a slice is either entirely empty (i.e., +for all and some fixed , ) or entirely full (i.e., for all and some fixed , ). Following the same scheme +as before, we can define a tensor format that stores dimension 2 (i.e., the + dimension) as a dense dimension, stores dimension 0 (i.e., the +dimension) of each slice along dimension 2 as a sparse dimension, and stores +dimension 1 (i.e., the dimension) of each nonempty row as a dense +dimension also:
+pytaco.format([pytaco.dense, pytaco.compressed, pytaco.dense], [2, 0, 1])
+Using the format above, we can then efficiently store without explicitly +storing any zero element.
+As a rough rule of thumb, using formats that accurately describe the sparsity +and structure of the operands lets TACO minimize memory traffic incurred to +load tensors from memory as well as minimize redundant work done to perform a +computation, which boosts performance. This is particularly the case when only +one operand is sparse and the computation does not involve adding elements of +multiple operands. This is not a hard and fast rule though. In +particular, computing with multiple sparse operands might prevent TACO from +applying some optimizations like parallelization +that might otherwise be possible if some of those operands were stored in dense +formats. Depending on how sparse your data actually is, this may or may not +negatively impact performance.
+The most reliable way to determine what are the best formats for storing +tensors in your application is to just try out many different formats and see +what works. Fortunately, as the examples above demonstrate, this is simple to +do with TACO.
+By default, TACO performs all computations using a single thread. The maximum
+number of threads that TACO may use to perform computations can be adjusted by
+calling the pytaco.set_num_threads
function. The example below, for
+instance, tells TACO that up to four threads may be used to execute any
+subsequent computation in parallel if possible:
pytaco.set_num_threads(4)
+In general, the maximum number of threads for performing computations should +not be set greater than the number of available processor cores. And depending +on the specific computation and characteristics of the data, setting the +maximum number of threads to be less than the number of processor cores may +actually yield better performance. As the example above demonstrates, TACO +makes it easy to try out different numbers of threads and see what works best +for your application.
+Note
+Setting the maximum number of available threads to be greater than one does +not guarantee that all computations will be executed in parallel. In +particular, TACO will not execute a computation in parallel if
+If TACO does not seem to be executing a computation in parallel, using +different formats to store the operands and result may help.
+By default, when performing computations in parallel, TACO will assign the same
+number of coordinates along a particular dimension to be processed by each
+thread. For instance, when adding two 1000-by-1000 matrices using two threads,
+TACO will have each thread compute exactly 500 rows of the result. This would
+be inefficient though if, for instance, all the nonzeros are stored in the
+first 500 rows of the operands, since one thread would end up doing all the
+work while the other thread does nothing. In cases like this, an alternative
+parallelization strategy can be specified by calling the
+pytaco.set_parallel_schedule
function:
pt.set_parallel_schedule("dynamic")
+In contrast to the default parallelization strategy, the dynamic strategy will +have each thread first compute just one row of the result. Whenever a thread +finishes computing a row, TACO will assign another row for that thread to +compute, and this process is repeated until all 1000 rows have been computed. +In this way, work is guaranteed to be evenly distributed between the two +threads regardless of the sparsity structures of the operands.
+Using a dynamic strategy for parallel execution will incur some overhead +though, since work is assigned to threads at runtime. This overhead can be +reduced by increasing the chunk size, which is the amount of additional work +that is assigned to a thread whenever it completes its previously assigned +work. The example below, for instance, tells TACO to assign ten additional +rows of the result (instead of just one) for a thread to compute whenever it +has completed the previous ten rows:
+pt.set_parallel_schedule("dynamic", 10)
+Since dynamic parallelization strategies incur additional overhead, whether or +not using them improves the performance of a computation will depend on how +evenly spread out the nonzero elements in the tensor operands are. If each +matrix contains roughly the same number of nonzeros in every row, for instance, +then using a dynamic strategy will likely not more evenly distribute work +between threads. In that case, the default static schedule would likely yield +better performance.
+TACO supports efficiently computing complicated tensor algebra expressions
+involving many discrete operations in a single shot. Let's say, for instance,
+that we would like to (element-wise) add two vectors b
and c
and compute
+the cosine of each element in the sum. We can, of course, simply compute the
+addition and the cosine of the sum in separate statements:
t[i] = b[i] + c[i]
+a[i] = pt.cos(t[i])
+The program above will first invoke TACO to add b
and c
, store the result
+into a temporary vector t
, and then invoke TACO again to compute the cosine
+of every element in t
. Performing the computation this way though not only
+requires additional memory for storing t
but also requires accessing the
+memory subsystem to first write t
to memory and then load t
back from
+memory, which is inefficient if the vectors are large and cannot be stored in
+cache. Instead, we can compute the addition and the cosine of the sum in a
+single statement:
a[i] = pt.cos(b[i] + c[i])
+For the program above, TACO will automatically generate code that, for every
+i
, immediately computes the cosine of b[i] + c[i]
as soon as the sum is
+computed. TACO thus avoids storing the sum of b
and c
in a temporary
+vector, thereby increasing the performance of the computation.
Fusing computations can improve performance if it does not require intermediate
+results to be recomputed multiple times, as is the case with the previous
+example. Let's say, however, that we would like to multiply a matrix B
by a
+vector c
and then multiply another matrix A
by the result of the first
+multiplication. As before, we can express both operations in a single
+statement:
y[i] = A[i,j] * B[j,k] * x[k]
+In this case though, computing both operations in one shot would require that
+the multiplication of B
and x
be redundantly recomputed for every
+(non-empty) row of A
, thus reducing performance. By contrast, computing the
+two matrix-vector multiplications in separate statement ensures that the result
+of the first matrix-vector multiplication does not have to be redundantly
+computed, thereby minimizing the amount of work needed to perform the
+computation:
t[j] = B[j,k] * c[k]
+y[i] = A[i,j] * t[j]
+
+ Tensor algebra computations can be expressed in TACO using tensor index +notation, which at a high level describes how each element in the result tensor +can be computed from elements in the operand tensors. As an example, matrix +addition can be expressed in index notation as
++ +
+where , , and denote two-dimensional tensors (i.e., matrices) +while and are index variables that represent abstract indices into +the corresponding dimensions of the tensors. In plain English, the example +above essentially states that, for every and , the element in the +-th row and -th column of should be assigned the sum of the +corresponding elements in and . Similarly, element-wise +multiplication of three tensors can be expressed in index notation as
++ +
+To define the same computation using the TACO Python API, we can write very +similar code, with the main difference being that we also have to explicitly +declare the index variables beforehand:
+i, j, k = pytaco.index_var(), pytaco.index_var(), pytaco.index_var()
+A[i,j,k] = B[i,j,k] * C[i,j,k] * D[i,j,k]
+This can also be rewritten more compactly as
+i, j, k = pytaco.get_index_vars(3)
+A[i,j,k] = B[i,j,k] * C[i,j,k] * D[i,j,k]
+Note
+Accesses to scalars also require the square brackets notation. Since
+scalars are equivalent to tensors with zero dimension, None
must be
+explicitly specified as indices to indicate that no index variable is
+needed to access a scalar. As an example, the following expresses the
+addition of two scalars beta
and gamma
:
alpha[None] = beta[None] + gamma[None]
+Warning
+TACO currently does not support computations that have a tensor as both an +operand and the result, such as the following:
+a[i] = a[i] * b[i]
+Such computations can be rewritten using explicit temporaries as follows:
+t[i] = a[i] * b[i]
+a[i] = t[i]
+Warning
+TACO currently does not support using the same index variable to index into
+multiple dimensions of the same tensor operand (e.g., A[i,i]
).
In all of the previous examples, all the index variables are used to index into +both the result and the operands of a computation. It is also possible for +an index variable to be used to index into the operands only, in which case the +dimension indexed by that index variable is reduced (summed) over. For +instance, the computation
++ +
+can be rewritten with the summation more explicit as
++ +
+and demonstrates how matrix-vector multiplication can be expressed in index +notation. Both forms are supported by TACO:
+i, j = pytaco.get_index_vars(2)
+
+y[i] = A[i,j] * x[j]
+y[i] = pytaco.sum(j, A[i,j] * x[j])
+Reductions that are not explicitly expressed are assumed to be over the +smallest subexpression that captures all uses of the corresponding reduction +variable. For instance, the computation
++ +
+is equivalent to
++ +
+whereas the computation
++ +
+is equivalent to
++ +
+TACO supports computations that broadcasts tensors along any number of
+dimensions. The following example, for instance, broadcasts the vector c
+along the row dimension of matrix B
, adding c
to each row of B
:
A[i, j] = B[i, j] + c[j]
+However, TACO does not support NumPy-style broadcasting of dimensions that have +a size of one. For example, the following is not allowed:
+A = pt.tensor([3,3])
+B = pt.tensor([3,3])
+C = pt.tensor([3,1])
+i, j = pt.get_index_vars(2)
+
+A[i, j] = B[i, j] + C[i, j] # ERROR!!
+Computations that transpose tensors can be expressed by rearranging the order
+in which index variables are used to access tensor operands. The following
+example, for instance, adds matrix B
to the transpose of matrix C
and
+stores the result in matrix A
:
A = pt.tensor([3,3], pt.format([dense, dense]))
+B = pt.tensor([3,3], pt.format([dense, dense]))
+C = pt.tensor([3,3], pt.format([dense, dense]))
+i, j = pt.get_index_vars(2)
+
+A[i,j] = B[i,j] + C[j,i]
+Note, however, that sparse dimensions of tensor operands impose dependencies on
+the order in which they can be accessed, based on the order in which they are
+stored in the operand formats. This means, for instance, that if B
is a CSR
+matrix, then B[i,j]
requires that the dimension indexed by i
be accessed
+before the dimension indexed by j
. TACO does not support any computation
+where these constraints form a cyclic dependency. So the same computation from
+before is not supported for CSR matrices, since the access of B
requires that
+i
be accessed before j
but the access of C
requires that j
be accessed
+before i
:
A = pt.tensor([3,3], pt.format([dense, compressed]))
+B = pt.tensor([3,3], pt.format([dense, compressed]))
+C = pt.tensor([3,3], pt.format([dense, compressed]))
+i, j = pt.get_index_vars(2)
+
+A[i,j] = B[i,j] + C[j,i] # ERROR!!
+As an alternative, you can first explicitly transpose C
by invoking its
+transpose
method, storing the result in a temporary, and then perform the
+addition with the already-transposed temporary:
A = pt.tensor([3,3], pt.format([dense, compressed]))
+B = pt.tensor([3,3], pt.format([dense, compressed]))
+C = pt.tensor([3,3], pt.format([dense, compressed]))
+i, j = pt.get_index_vars(2)
+
+Ct = C.transpose([1, 0]) # Ct is also stored in the CSR format
+A[i,j] = B[i,j] + Ct[i,j]
+Similarly, the following computation is not supported for the same reason that
+the access of B
, which is stored in row-major order, requires i
to be
+accessed before j
but the access of C
, which is stored in column-major
+order, requires j
to be accessed before i
:
A = pt.tensor([3,3], pt.format([dense, compressed]))
+B = pt.tensor([3,3], pt.format([dense, compressed]))
+C = pt.tensor([3,3], pt.format([dense, compressed], [1, 0]))
+i, j = pt.get_index_vars(2)
+
+A[i,j] = B[i,j] + C[i,j] # ERROR!!
+We can again perform the same computation by invoking transpose
, this time to
+repack C
into the same CSR format as A
and B
before computing the
+addition:
A = pt.tensor([3,3], pt.format([dense, compressed]))
+B = pt.tensor([3,3], pt.format([dense, compressed]))
+C = pt.tensor([3,3], pt.format([dense, compressed], [1, 0]))
+i, j = pt.get_index_vars(2)
+
+Cp = C.transpose([0, 1], pt.format([dense, compressed])) # Store a copy of C in the CSR format
+A[i,j] = B[i,j] + Cp[i,j]
+Once a tensor algebra computation has been defined, you can simply invoke the
+result tensor's evaluate
method to perform the actual computation:
A.evaluate()
+Under the hood, TACO will first invoke the result tensor's compile
+method to generate code that performs the computation. TACO will then perform
+the actual computation by first invoking assemble
to compute the sparsity
+structure of the result and subsequently invoking compute
to compute the
+values of the result's nonzero elements. Of course, you can also manually
+invoke these methods in order to more precisely control when each step happens:
A.compile()
+A.assemble()
+A.compute()
+If you define a computation and then access the result without first manually
+invoking evaluate
or compile
/assemble
/compute
, TACO will automatically
+invoke the computation immediately before the result is accessed. In the
+following example, for instance, TACO will automatically generate code to
+compute the vector addition and then also actually perform the computation
+right before a[0]
is printed:
a[i] = b[i] + c[i]
+print(a[0])
+
+ pytaco.tensor
objects, which represent mathematical tensors, form the core of
+the TACO Python library. You can can declare a new tensor by specifying the
+sizes of each dimension, the format
+that will be used to store the tensor, and the
+datatype of the tensor's nonzero
+elements:
# Import the TACO Python library
+import pytaco as pt
+from pytaco import dense, compressed
+
+# Declare a new tensor of double-precision floats with dimensions
+# 512 x 64 x 2048, stored as a dense-sparse-sparse tensor
+A = pt.tensor([512, 64, 2048], pt.format([dense, compressed, compressed]), pt.float64)
+The datatype can be omitted, in which case TACO will default to using
+pt.float32
to store the tensor's nonzero elements:
# Declare the same tensor as before
+A = pt.tensor([512, 64, 2048], pt.format([dense, compressed, compressed]))
+Instead of specifying a format that is tied to the number of dimensions that a +tensor has, we can simply specify whether all dimensions are dense or sparse:
+# Declare a tensor where all dimensions are dense
+A = pt.tensor([512, 64, 2048], dense)
+
+# Declare a tensor where all dimensions are sparse
+B = pt.tensor([512, 64, 2048], compressed)
+Scalars, which correspond to tensors that have zero dimension, can be declared +and initialized with an arbitrary value as demonstrated below:
+# Declare a scalar
+aplha = pt.tensor(42.0)
+Conceptually, you can think of a tensor as a tree where each level (excluding +the root) corresponding to a dimension of the tensor. Each path from the root +to a leaf node represents the coordinates of a tensor element and its +corresponding value. Which dimension of the tensor each level of the tree +corresponds to is determined by the order in which tensor dimensions are +stored.
+TACO uses a novel scheme that can describe different storage formats for a +tensor by specifying the order in which tensor dimensions are stored and +whether each dimension is sparse or dense. A sparse (compressed) dimension +stores only the subset of the dimension that contains non-zero values, using +index arrays that are found in the compressed sparse row (CSR) matrix format. +A dense dimension, on the other hand, conceptually stores both zeros and +non-zeros. This scheme is flexibile enough to express many commonly-used +tensor storage formats:
+import pytaco as pt
+from pytaco import dense, compressed
+
+dm = pt.format([dense, dense]) # (Row-major) dense matrix format
+csr = pt.format([dense, compressed]) # Compressed sparse row matrix format
+csc = pt.format([dense, compressed], [1, 0]) # Compressed sparse column matrix format
+dcsc = pt.format([compressed, compressed], [1, 0]) # Doubly compressed sparse column matrix format
+csf = pt.format([compressed, compressed, compressed]) # Compressed sparse fiber tensor format
+As demonstrated above, you can define a new tensor storage format by creating a
+pytaco.format
object. This requires specifying whether each tensor dimension
+is dense or sparse as well as (optionally) the order in which dimensions should
+be stored. TACO also predefines some common tensor formats (including
+pt.csr
and pt.csc
) that you can use out of the box.
Tensors can be made by using python indexing syntax. For example, one may write
+the following: You can initialize a tensor by calling its insert
method to
+add a nonzero element to the tensor. The insert
method takes two arguments:
+a list specifying the coordinates of the nonzero element to be added and the
+value to be inserted at that coordinate:
# Declare a sparse tensor
+A = pt.tensor([512, 64, 2048], compressed)
+
+# Set A(0, 1, 0) = 42.0
+A.insert([0, 1, 0], 42.0)
+If multiple elements are inserted at the same coordinates, they are summed +together:
+# Declare a sparse tensor
+A = pt.tensor([512, 64, 2048], compressed)
+
+# Set A(0, 1, 0) = 42.0 + 24.0 = 66.0
+A.insert([0, 1, 0], 42.0)
+A.insert([0, 1, 0], 24.0)
+The insert
method adds the inserted nonzero element to a temporary buffer.
+Before a tensor can actually be used in a computation though, the pack
method
+must be invoked to pack the tensor into the storage format that was specified
+when the tensor was first declared. TACO will automatically do this
+immediately before the tensor is used in a computation. You can also manually
+invoke pack
though if you need full control over when exactly that is done:
A.pack()
+You can then iterate over the nonzero elements of the tensor as follows:
+for coordinates, val in A:
+ print(val)
+Rather than manually constructing a tensor, you can load tensors directly from
+file by invoking the pytaco.read
function:
# Load a dense-sparse-sparse tensor from file "A.tns"
+A = pt.read("A.tns", pt.format([dense, compressed, compressed]))
+By default, pytaco.read
returns a tensor that has already been packed into
+the specified storage format. You can optionally pass a Boolean flag as an
+argument to indicate whether the returned tensor should be packed or not:
# Load an unpacked tensor from file "A.tns"
+A = pt.read("A.tns", format([dense, compressed, compressed]), false)
+The loaded tensor will then remain unpacked until the pack
method is manually
+invoked or a computation that uses the tensor is performed.
You can also write a tensor directly to file by invoking the pytaco.write
+function:
# Write tensor A to file "A.tns"
+pt.write("A.tns", A)
+TACO supports loading tensors from and storing tensors to the following file +formats:
+ +Tensors can also be initialized with either NumPy arrays or SciPy sparse (CSR +or CSC) matrices:
+import pytaco as pt
+import numpy as np
+import scipy.sparse
+
+# Assuming SciPy matrix is stored in CSR
+sparse_matrix = scipy.sparse.load_npz('sparse_matrix.npz')
+
+# Cast the matrix as a TACO tensor (also stored in CSR)
+taco_tensor = pt.from_sp_csr(sparse_matrix)
+
+# We can also load a NumPy array
+np_array = np.load('arr.npy')
+
+# And initialize a TACO tensor from this array
+dense_tensor = pt.from_array(np_array)
+We can also export TACO tensors to either NumPy arrays or SciPy sparse +matrices:
+# Convert the tensor to a SciPy CSR matrix
+sparse_matrix = taco_tensor.to_sp_csr()
+
+# Convert the tensor to a NumPy array
+np_array = dense_tensor.to_array()
+
+ The scheduling language enables users to specify and compose transformations to +further optimize the code generated by TACO.
+Consider the following SpMV computation and associated code, which we will +transform below: +
Format csr({Dense,Sparse});
+Tensor<double> A("A", {512, 64}, csr);
+Tensor<double> x("x", {64}, {Dense});
+Tensor<double> y("y", {512}, {Dense});
+
+IndexVar i("i"), j("j");
+Access matrix = A(i, j);
+y(i) = matrix * x(j);
+IndexStmt stmt = y.getAssignment().concretize();
+for (int32_t i = 0; i < A1_dimension; i++) {
+ for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
+ int32_t j = A2_crd[jA];
+ y_vals[i] = y_vals[i] + A_vals[jA] * x_vals[j];
+ }
+}
+The pos(i, ipos, access)
transformation takes in an index variable i
that
+iterates over the coordinate space of access
and replaces it with a derived
+index variable ipos
that iterates over the same iteration range, but with
+respect to the the position space.
Since the pos
transformation is not valid for dense level formats, for the
+SpMV example, the following would result in an error:
+
stmt = stmt.pos(i, IndexVar("ipos"), matrix);
+We could instead have: +
stmt = stmt.pos(j, IndexVar("jpos"), matrix);
+for (int32_t i = 0; i < A1_dimension; i++) {
+ for (int32_t jposA = A2_pos[i]; jposA < A2_pos[(i + 1)]; jposA++) {
+ if (jposA < A2_pos[i] || jposA >= A2_pos[(i + 1)])
+ continue;
+
+ int32_t j = A2_crd[jposA];
+ y_vals[i] = y_vals[i] + A_vals[jposA] * x_vals[j];
+ }
+}
+The fuse(i, j, f)
transformation takes in two index variables i
and j
,
+where j
is directly nested under i
, and collapses them into a fused index
+variable f
that iterates over the product of the coordinates i
and j
.
fuse
helps facilitate other transformations, such as iterating over the
+position space of several index variables, as in this SpMV example:
+
IndexVar f("f");
+stmt = stmt.fuse(i, j, f);
+stmt = stmt.pos(f, IndexVar("fpos"), matrix);
+for (int32_t fposA = 0; fposA < A2_pos[A1_dimension]; fposA++) {
+ if (fposA >= A2_pos[A1_dimension])
+ continue;
+
+ int32_t f = A2_crd[fposA];
+ while (fposA == A2_pos[(i_pos + 1)]) {
+ i_pos++;
+ i = i_pos;
+ }
+ y_vals[i] = y_vals[i] + A_vals[fposA] * x_vals[f];
+}
+The split(i, i0, i1, splitFactor)
transformation splits (strip-mines) an
+index variable i
into two nested index variables i0
and i1
. The size of
+the inner index variable i1
is then held constant at splitFactor
, which
+must be a positive integer.
For the SpMV example, we could have: +
stmt = stmt.split(i, IndexVar("i0"), IndexVar("i1"), 16);
+for (int32_t i0 = 0; i0 < ((A1_dimension + 15) / 16); i0++) {
+ for (int32_t i1 = 0; i1 < 16; i1++) {
+ int32_t i = i0 * 16 + i1;
+ if (i >= A1_dimension)
+ continue;
+
+ for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
+ int32_t j = A2_crd[jA];
+ y_vals[i] = y_vals[i] + A_vals[jA] * x_vals[j];
+ }
+ }
+}
+
+
+
+The precompute(expr, i, iw, workspace)
transformation, which is described in
+more detail here, leverages
+scratchpad memories and reorders computations to increase cache locality.
Given a subexpression expr
to precompute, an index variable i
to precompute
+over, and an index variable iw
(which can be the same or different as i
) to
+precompute with, the precomputed results are stored in the tensor variable
+workspace
.
For the SpMV example, if rhs
is the right hand side of the original
+statement, we could have:
+
TensorVar workspace("workspace", Type(Float64, {Dimension(64)}), taco::dense);
+stmt = stmt.precompute(rhs, j, j, workspace);
+for (int32_t i = 0; i < A1_dimension; i++) {
+ double* restrict workspace = 0;
+ workspace = (double*)malloc(sizeof(double) * 64);
+ for (int32_t pworkspace = 0; pworkspace < 64; pworkspace++) {
+ workspace[pworkspace] = 0.0;
+ }
+ for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
+ int32_t j = A2_crd[jA];
+ workspace[j] = A_vals[jA] * x_vals[j];
+ }
+ for (int32_t j = 0; j < 64; j++) {
+ y_vals[i] = y_vals[i] + workspace[j];
+ }
+ free(workspace);
+ }
+The reorder(vars)
transformation takes in a new ordering for a set of index
+variables in the expression that are directly nested in the iteration order.
For the SpMV example, we could have: +
stmt = stmt.reorder({j, i});
+for (int32_t jA = A2_pos[iA]; jA < A2_pos[(iA + 1)]; jA++) {
+ int32_t j = A2_crd[jA];
+ for (int32_t i = 0; i < A1_dimension; i++) {
+ y_vals[i] = y_vals[i] + A_vals[jA] * x_vals[j];
+ }
+ }
+The bound(i, ibound, bound, bound_type)
transformation replaces an index
+variable i
with an index variable ibound
that obeys a compile-time
+constraint on its iteration space, incorporating knowledge about the size or
+structured sparsity pattern of the corresponding input. The meaning of bound
+depends on the bound_type
.
For the SpMV example, we could have +
stmt = stmt.bound(i, IndexVar("ibound"), 100, BoundType::MaxExact);
+for (int32_t ibound = 0; ibound < 100; ibound++) {
+ for (int32_t jA = A2_pos[ibound]; jA < A2_pos[(ibound + 1)]; jA++) {
+ int32_t j = A2_crd[jA];
+ y_vals[ibound] = y_vals[ibound] + A_vals[jA] * x_vals[j];
+ }
+}
+The unroll(i, unrollFactor)
transformation unrolls the loop corresponding to
+an index variable i
by unrollFactor
number of iterations, where
+unrollFactor
is a positive integer.
For the SpMV example, we could have +
stmt = stmt.split(i, i0, i1, 32);
+stmt = stmt.unroll(i0, 4);
+if ((((A1_dimension + 31) / 32) * 32 + 32) + (((A1_dimension + 31) / 32) * 32 + 32) >= A1_dimension) {
+ for (int32_t i0 = 0; i0 < ((A1_dimension + 31) / 32); i0++) {
+ for (int32_t i1 = 0; i1 < 32; i1++) {
+ int32_t i = i0 * 32 + i1;
+ if (i >= A1_dimension)
+ continue;
+
+ for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
+ int32_t j = A2_crd[jA];
+ y_vals[i] = y_vals[i] + A_vals[jA] * x_vals[j];
+ }
+ }
+ }
+}
+else {
+ #pragma unroll 4
+ for (int32_t i0 = 0; i0 < ((A1_dimension + 31) / 32); i0++) {
+ for (int32_t i1 = 0; i1 < 32; i1++) {
+ int32_t i = i0 * 32 + i1;
+ for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
+ int32_t j = A2_crd[jA];
+ y_vals[i] = y_vals[i] + A_vals[jA] * x_vals[j];
+ }
+ }
+ }
+}
+The parallelize(i, parallel_unit, output_race_strategy)
transformation tags
+an index variable i
for parallel execution on hardware type parallel_unit
.
+Data races are handled by an output_race_strategy
. Since the other
+transformations expect serial code, parallelize
must come last in a series of
+transformations.
For the SpMV example, we could have +
stmt = stmt.parallelize(i, ParallelUnit::CPUThread, OutputRaceStrategy::NoRaces);
+#pragma omp parallel for schedule(runtime)
+for (int32_t i = 0; i < A1_dimension; i++) {
+ for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
+ int32_t j = A2_crd[jA];
+ y_vals[i] = y_vals[i] + A_vals[jA] * x_vals[j];
+ }
+}
+
+ Sparse matrix-vector multiplication (SpMV) is a bottleneck computation in many +scientific and engineering computations. Mathematically, SpMV can be expressed +as
++ +
+where is a sparse matrix and , , and are dense vectors. +The computation can also be expressed in index +notation as
++ +
+You can use the TACO C++ API to easily and efficiently compute SpMV, as shown +here:
+// On Linux and MacOS, you can compile and run this program like so:
+// g++ -std=c++11 -O3 -DNDEBUG -DTACO -I ../../include -L../../build/lib spmv.cpp -o spmv -ltaco
+// LD_LIBRARY_PATH=../../build/lib ./spmv
+#include <random>
+#include "taco.h"
+using namespace taco;
+int main(int argc, char* argv[]) {
+ std::default_random_engine gen(0);
+ std::uniform_real_distribution<double> unif(0.0, 1.0);
+ // Predeclare the storage formats that the inputs and output will be stored as.
+ // To define a format, you must specify whether each dimension is dense or sparse
+ // and (optionally) the order in which dimensions should be stored. The formats
+ // declared below correspond to compressed sparse row (csr) and dense vector (dv).
+ Format csr({Dense,Sparse});
+ Format dv({Dense});
+
+ // Load a sparse matrix from file (stored in the Matrix Market format) and
+ // store it as a compressed sparse row matrix. Matrices correspond to order-2
+ // tensors in taco. The matrix in this example can be downloaded from:
+ // https://www.cise.ufl.edu/research/sparse/MM/Boeing/pwtk.tar.gz
+ Tensor<double> A = read("pwtk.mtx", csr);
+
+ // Generate a random dense vector and store it in the dense vector format.
+ // Vectors correspond to order-1 tensors in taco.
+ Tensor<double> x({A.getDimension(1)}, dv);
+ for (int i = 0; i < x.getDimension(0); ++i) {
+ x.insert({i}, unif(gen));
+ }
+ x.pack();
+
+ // Generate another random dense vetor and store it in the dense vector format..
+ Tensor<double> z({A.getDimension(0)}, dv);
+ for (int i = 0; i < z.getDimension(0); ++i) {
+ z.insert({i}, unif(gen));
+ }
+ z.pack();
+
+ // Declare and initializing the scaling factors in the SpMV computation.
+ // Scalars correspond to order-0 tensors in taco.
+ Tensor<double> alpha(42.0);
+ Tensor<double> beta(33.0);
+
+ // Declare the output matrix to be a sparse matrix with the same dimensions as
+ // input matrix B, to be also stored as a doubly compressed sparse row matrix.
+ Tensor<double> y({A.getDimension(0)}, dv);
+ // Define the SpMV computation using index notation.
+ IndexVar i, j;
+ y(i) = alpha() * (A(i,j) * x(j)) + beta() * z(i);
+ // At this point, we have defined how entries in the output vector should be
+ // computed from entries in the input matrice and vectorsbut have not actually
+ // performed the computation yet. To do so, we must first tell taco to generate
+ // code that can be executed to compute the SpMV operation.
+ y.compile();
+ // We can now call the functions taco generated to assemble the indices of the
+ // output vector and then actually compute the SpMV.
+ y.assemble();
+ y.compute();
+ // Write the output of the computation to file (stored in the FROSTT format).
+ write("y.tns", y);
+}
+You can also use the TACO Python API to perform the same computation, as +demonstrated here:
+import pytaco as pt
+from pytaco import compressed, dense
+import numpy as np
+
+# Define formats for storing the sparse matrix and dense vectors
+csr = pt.format([dense, compressed])
+dv = pt.format([dense])
+
+# Load a sparse matrix stored in the matrix market format) and store it
+# as a CSR matrix. The matrix in this example can be downloaded from:
+# https://www.cise.ufl.edu/research/sparse/MM/Boeing/pwtk.tar.gz
+A = pt.read("pwtk.mtx", csr)
+
+# Generate two random vectors using NumPy and pass them into TACO
+x = pt.from_array(np.random.uniform(size=A.shape[1]))
+z = pt.from_array(np.random.uniform(size=A.shape[0]))
+
+# Declare the result to be a dense vector
+y = pt.tensor([A.shape[0]], dv)
+
+# Declare index vars
+i, j = pt.get_index_vars(2)
+
+# Define the SpMV computation
+y[i] = A[i, j] * x[j] + z[i]
+
+# Perform the SpMV computation and write the result to file
+pt.write("y.tns", y)
+When you run the above Python program, TACO will generate code under the hood +that efficiently performs the computation in one shot. This lets TACO avoid +materializing the intermediate matrix-vector product, thus reducing the amount +of memory accesses and speeding up the computation.
+ +' + escapeHtml(summary) +'
' + noResultsText + '
'); + } +} + +function doSearch () { + var query = document.getElementById('mkdocs-search-query').value; + if (query.length > min_search_length) { + if (!window.Worker) { + displayResults(search(query)); + } else { + searchWorker.postMessage({query: query}); + } + } else { + // Clear results for short queries + displayResults([]); + } +} + +function initSearch () { + var search_input = document.getElementById('mkdocs-search-query'); + if (search_input) { + search_input.addEventListener("keyup", doSearch); + } + var term = getSearchTermFromLocation(); + if (term) { + search_input.value = term; + doSearch(); + } +} + +function onWorkerMessage (e) { + if (e.data.allowSearch) { + initSearch(); + } else if (e.data.results) { + var results = e.data.results; + displayResults(results); + } else if (e.data.config) { + min_search_length = e.data.config.min_search_length-1; + } +} + +if (!window.Worker) { + console.log('Web Worker API not supported'); + // load index in main thread + $.getScript(joinUrl(base_url, "search/worker.js")).done(function () { + console.log('Loaded worker'); + init(); + window.postMessage = function (msg) { + onWorkerMessage({data: msg}); + }; + }).fail(function (jqxhr, settings, exception) { + console.error('Could not load worker.js'); + }); +} else { + // Wrap search in a web worker + var searchWorker = new Worker(joinUrl(base_url, "search/worker.js")); + searchWorker.postMessage({init: true}); + searchWorker.onmessage = onWorkerMessage; +} diff --git a/search/search_index.json b/search/search_index.json new file mode 100644 index 0000000..9147272 --- /dev/null +++ b/search/search_index.json @@ -0,0 +1 @@ +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"index.html","text":"TACO is a library for performing sparse and dense linear algebra and tensor algebra computations. The computations can range from relatively simple ones like sparse matrix-vector multiplication to more complex ones like matricized tensor times Khatri-Rao product. All these computations can be performed on any mix of dense and sparse tensors. Under the hood, TACO automatically generates efficient code to perform these computations. The sidebar to the left links to documentation for the TACO C++ and Python APIs as well as some examples demonstrating how TACO can be used in real-world applications. System Requirements A C compiler that supports C99, such as GCC or Clang Support for OpenMP is also required if parallel execution is desired Python 3 with NumPy and SciPy (for the Python API) Getting Help Questions and bug reports can be submitted here .","title":"Home"},{"location":"index.html#system-requirements","text":"A C compiler that supports C99, such as GCC or Clang Support for OpenMP is also required if parallel execution is desired Python 3 with NumPy and SciPy (for the Python API)","title":"System Requirements"},{"location":"index.html#getting-help","text":"Questions and bug reports can be submitted here .","title":"Getting Help"},{"location":"benchmarking.html","text":"Guide to Benchmarking The performance of Python applications that use TACO can be measured using Python's built-in time.perf_counter function with minimal changes to the applications. As an example, we can benchmark the performance of the scientific computing application shown here as follows: import pytaco as pt from pytaco import compressed, dense import numpy as np import time csr = pt.format([dense, compressed]) dv = pt.format([dense]) A = pt.read(\"pwtk.mtx\", csr) x = pt.from_array(np.random.uniform(size=A.shape[1])) z = pt.from_array(np.random.uniform(size=A.shape[0])) y = pt.tensor([A.shape[0]], dv) i, j = pt.get_index_vars(2) y[i] = A[i, j] * x[j] + z[i] # Tell TACO to generate code to perform the SpMV computation y.compile() # Benchmark the actual SpMV computation start = time.perf_counter() y.compute() end = time.perf_counter() print(\"Execution time: {0} seconds\".format(end - start)) In order to accurately measure TACO's computational performance, only the time it takes to actually perform a computation should be measured. The time it takes to generate code under the hood for performing that computation should not be measured , since this overhead can be quite variable but can often be amortized in practice. By default though, TACO will only generate and compile code it needs for performing a computation immediately before it has to actually perform the computation. As the example above demonstrates, by manually calling the result tensor's compile method, we can tell TACO to generate code needed for performing the computation before benchmarking starts, letting us measure only the performance of the computation itself. Warning pytaco.evaluate and pytaco.einsum should not be used to benchmark TACO's computational performance, since timing those functions will include the time it takes to generate code for performing the computation. The time it takes to construct the initial operand tensors should also not be measured , since again this overhead can often be amortized in practice. By default, pytaco.read and functions for converting NumPy arrays and SciPy matrices to TACO tensors return fully constructed tensors. If you add nonzero elements to an operand tensor by invoking its insert method though, then pack must also be explicitly invoked before any benchmarking is done: import pytaco as pt from pytaco import compressed, dense import numpy as np import random import time csr = pt.format([dense, compressed]) dv = pt.format([dense]) A = pt.read(\"pwtk.mtx\", csr) x = pt.tensor([A.shape[1]], dv) z = pt.tensor([A.shape[0]], dv) y = pt.tensor([A.shape[0]], dv) # Insert random values into x and z and pack them into dense arrays for k in range(A.shape[1]): x.insert([k], random.random()) x.pack() for k in range(A.shape[0]): z.insert([k], random.random()) z.pack() i, j = pt.get_index_vars(2) y[i] = A[i, j] * x[j] + z[i] y.compile() start = time.perf_counter() y.compute() end = time.perf_counter() print(\"Execution time: {0} seconds\".format(end - start)) TACO avoids regenerating code for performing the same computation though as long as the computation is redefined with the same index variables and with the same operand and result tensors. Thus, if your application executes the same computation many times in a loop and if the computation is executed on sufficiently large data sets, TACO will naturally amortize the overhead associated with generating code for performing the computation. In such scenarios, it is acceptable to include the initial code generation overhead in the performance measurement: import pytaco as pt from pytaco import compressed, dense import numpy as np import time csr = pt.format([dense, compressed]) dv = pt.format([dense]) A = pt.read(\"pwtk.mtx\", csr) x = pt.tensor([A.shape[1]], dv) z = pt.tensor([A.shape[0]], dv) y = pt.tensor([A.shape[0]], dv) for k in range(A.shape[1]): x.insert([k], random.random()) x.pack() for k in range(A.shape[0]): z.insert([k], random.random()) z.pack() i, j = pt.get_index_vars(2) # Benchmark the iterative SpMV computation, including overhead for # generating code in the first iteration to perform the computation start = time.perf_counter() for k in range(1000): y[i] = A[i, j] * x[j] + z[i] y.evaluate() x[i] = y[i] x.evaluate() end = time.perf_counter() print(\"Execution time: {0} seconds\".format(end - start)) Warning In order to avoid regenerating code for performing a computation, the computation must be redefined with the exact same index variable objects and also with the exact same tensor objects for operands and result. In the example above, every loop iteration redefines the computation of y and x using the same tensor and index variable objects costructed outside the loop, so TACO will only generate code to compute y and x in the first iteration. If the index variables were constructed inside the loop though, TACO would regenerate code to compute y and x in every loop iteration, and the compilation overhead would not be amortized. Note As a rough rule of thumb, if a computation takes on the order of seconds or more in total to perform across all invocations with identical operands and result (and is always redefined with identical index variables), then it is acceptable to include the overhead associated with generating code for performing the computation in performance measurements.","title":"Guide to Benchmarking"},{"location":"benchmarking.html#guide-to-benchmarking","text":"The performance of Python applications that use TACO can be measured using Python's built-in time.perf_counter function with minimal changes to the applications. As an example, we can benchmark the performance of the scientific computing application shown here as follows: import pytaco as pt from pytaco import compressed, dense import numpy as np import time csr = pt.format([dense, compressed]) dv = pt.format([dense]) A = pt.read(\"pwtk.mtx\", csr) x = pt.from_array(np.random.uniform(size=A.shape[1])) z = pt.from_array(np.random.uniform(size=A.shape[0])) y = pt.tensor([A.shape[0]], dv) i, j = pt.get_index_vars(2) y[i] = A[i, j] * x[j] + z[i] # Tell TACO to generate code to perform the SpMV computation y.compile() # Benchmark the actual SpMV computation start = time.perf_counter() y.compute() end = time.perf_counter() print(\"Execution time: {0} seconds\".format(end - start)) In order to accurately measure TACO's computational performance, only the time it takes to actually perform a computation should be measured. The time it takes to generate code under the hood for performing that computation should not be measured , since this overhead can be quite variable but can often be amortized in practice. By default though, TACO will only generate and compile code it needs for performing a computation immediately before it has to actually perform the computation. As the example above demonstrates, by manually calling the result tensor's compile method, we can tell TACO to generate code needed for performing the computation before benchmarking starts, letting us measure only the performance of the computation itself. Warning pytaco.evaluate and pytaco.einsum should not be used to benchmark TACO's computational performance, since timing those functions will include the time it takes to generate code for performing the computation. The time it takes to construct the initial operand tensors should also not be measured , since again this overhead can often be amortized in practice. By default, pytaco.read and functions for converting NumPy arrays and SciPy matrices to TACO tensors return fully constructed tensors. If you add nonzero elements to an operand tensor by invoking its insert method though, then pack must also be explicitly invoked before any benchmarking is done: import pytaco as pt from pytaco import compressed, dense import numpy as np import random import time csr = pt.format([dense, compressed]) dv = pt.format([dense]) A = pt.read(\"pwtk.mtx\", csr) x = pt.tensor([A.shape[1]], dv) z = pt.tensor([A.shape[0]], dv) y = pt.tensor([A.shape[0]], dv) # Insert random values into x and z and pack them into dense arrays for k in range(A.shape[1]): x.insert([k], random.random()) x.pack() for k in range(A.shape[0]): z.insert([k], random.random()) z.pack() i, j = pt.get_index_vars(2) y[i] = A[i, j] * x[j] + z[i] y.compile() start = time.perf_counter() y.compute() end = time.perf_counter() print(\"Execution time: {0} seconds\".format(end - start)) TACO avoids regenerating code for performing the same computation though as long as the computation is redefined with the same index variables and with the same operand and result tensors. Thus, if your application executes the same computation many times in a loop and if the computation is executed on sufficiently large data sets, TACO will naturally amortize the overhead associated with generating code for performing the computation. In such scenarios, it is acceptable to include the initial code generation overhead in the performance measurement: import pytaco as pt from pytaco import compressed, dense import numpy as np import time csr = pt.format([dense, compressed]) dv = pt.format([dense]) A = pt.read(\"pwtk.mtx\", csr) x = pt.tensor([A.shape[1]], dv) z = pt.tensor([A.shape[0]], dv) y = pt.tensor([A.shape[0]], dv) for k in range(A.shape[1]): x.insert([k], random.random()) x.pack() for k in range(A.shape[0]): z.insert([k], random.random()) z.pack() i, j = pt.get_index_vars(2) # Benchmark the iterative SpMV computation, including overhead for # generating code in the first iteration to perform the computation start = time.perf_counter() for k in range(1000): y[i] = A[i, j] * x[j] + z[i] y.evaluate() x[i] = y[i] x.evaluate() end = time.perf_counter() print(\"Execution time: {0} seconds\".format(end - start)) Warning In order to avoid regenerating code for performing a computation, the computation must be redefined with the exact same index variable objects and also with the exact same tensor objects for operands and result. In the example above, every loop iteration redefines the computation of y and x using the same tensor and index variable objects costructed outside the loop, so TACO will only generate code to compute y and x in the first iteration. If the index variables were constructed inside the loop though, TACO would regenerate code to compute y and x in every loop iteration, and the compilation overhead would not be amortized. Note As a rough rule of thumb, if a computation takes on the order of seconds or more in total to perform across all invocations with identical operands and result (and is always redefined with identical index variables), then it is acceptable to include the overhead associated with generating code for performing the computation in performance measurements.","title":"Guide to Benchmarking"},{"location":"computations.html","text":"Computing on Tensors Specifying Tensor Algebra Computations Tensor algebra computations can be expressed in TACO with tensor index notation, which at a high level describes how each element in the output tensor can be computed from elements in the input tensors. As an example, matrix addition can be expressed in index notation as A(i,j) = B(i,j) + C(i,j) where A , B , and C denote order-2 tensors (i.e. matrices) while i and j are index variables that represent abstract indices into the corresponding dimensions of the tensors. In words, the example above essentially states that, for every i and j , the element in the i -th row and j -th column of the A should be assigned the sum of the corresponding elements in B and C . Similarly, element-wise multiplication of three order-3 tensors can be expressed in index notation as follows A(i,j,k) = B(i,j,k) * C(i,j,k) * D(i,j,k) The syntax shown above corresponds to exactly what you would have to write in C++ with TACO to define tensor algebra computations. Note, however, that prior to defining a tensor algebra computation, all index variables have to be declared. This can be done as shown below: IndexVar i, j, k; // Declare index variables for previous example Expressing Reductions In both of the previous examples, all of the index variables are used to index into both the output and the inputs. However, it is possible for an index variable to be used to index into the inputs only, in which case the index variable is reduced (summed) over. For instance, the following example y(i) = A(i,j) * x(j) can be rewritten with the summation more explicit as y(i) = \\sum_{j} A(i,j) \\cdot x(j) and demonstrates how matrix-vector multiplication can be expressed in index notation. Note that, in TACO, reductions are assumed to be over the smallest subexpression that captures all uses of the corresponding reduction variable. For instance, the following computation y(i) = A(i,j) * x(j) + z(i) can be rewritten with the summation more explicit as y(i) = \\big(\\sum_{j} A(i,j) \\cdot x(j)\\big) + z(i), whereas the following computation y(i) = A(i,j) * x(j) + z(j) can be rewritten with the summation more explicit as y(i) = \\sum_{j} \\big(A(i,j) \\cdot x(j) + z(i)\\big). Performing the Computation Once a tensor algebra computation has been defined (and all of the inputs have been initialized ), you can simply invoke the output tensor's evaluate method to perform the actual computation: A.evaluate(); // Perform the computation defined previously for output tensor A Under the hood, when you invoke the evaluate method, TACO first invokes the output tensor's compile method to generate kernels that assembles the output indices (if the tensor contains any sparse dimensions) and that performs the actual computation. TACO would then call the two generated kernels by invoking the output tensor's assemble and compute methods. You can manually invoke these methods instead of calling evaluate as demonstrated below: A.compile(); // Generate output assembly and compute kernels A.assemble(); // Invoke the output assembly kernel to assemble the output indices A.compute(); // Invoke the compute kernel to perform the actual computation This can be useful if you want to perform the same computation multiple times, in which case it suffices to invoke compile once before the first time the computation is performed. Lazy Execution It is also possible to compute on tensors without having to explicitly invoke compile , assemble , or compute . Once you attempt to modify or view the output of a computation, TACO would automatically invoke those methods if necessary in order to compute the values in the output tensor. If the input to a computation is itself the output of another computation, then TACO would also automatically ensure that the latter computation is fully executed first.","title":"Computing on Tensors"},{"location":"computations.html#computing-on-tensors","text":"","title":"Computing on Tensors"},{"location":"computations.html#specifying-tensor-algebra-computations","text":"Tensor algebra computations can be expressed in TACO with tensor index notation, which at a high level describes how each element in the output tensor can be computed from elements in the input tensors. As an example, matrix addition can be expressed in index notation as A(i,j) = B(i,j) + C(i,j) where A , B , and C denote order-2 tensors (i.e. matrices) while i and j are index variables that represent abstract indices into the corresponding dimensions of the tensors. In words, the example above essentially states that, for every i and j , the element in the i -th row and j -th column of the A should be assigned the sum of the corresponding elements in B and C . Similarly, element-wise multiplication of three order-3 tensors can be expressed in index notation as follows A(i,j,k) = B(i,j,k) * C(i,j,k) * D(i,j,k) The syntax shown above corresponds to exactly what you would have to write in C++ with TACO to define tensor algebra computations. Note, however, that prior to defining a tensor algebra computation, all index variables have to be declared. This can be done as shown below: IndexVar i, j, k; // Declare index variables for previous example","title":"Specifying Tensor Algebra Computations"},{"location":"computations.html#expressing-reductions","text":"In both of the previous examples, all of the index variables are used to index into both the output and the inputs. However, it is possible for an index variable to be used to index into the inputs only, in which case the index variable is reduced (summed) over. For instance, the following example y(i) = A(i,j) * x(j) can be rewritten with the summation more explicit as y(i) = \\sum_{j} A(i,j) \\cdot x(j) and demonstrates how matrix-vector multiplication can be expressed in index notation. Note that, in TACO, reductions are assumed to be over the smallest subexpression that captures all uses of the corresponding reduction variable. For instance, the following computation y(i) = A(i,j) * x(j) + z(i) can be rewritten with the summation more explicit as y(i) = \\big(\\sum_{j} A(i,j) \\cdot x(j)\\big) + z(i), whereas the following computation y(i) = A(i,j) * x(j) + z(j) can be rewritten with the summation more explicit as y(i) = \\sum_{j} \\big(A(i,j) \\cdot x(j) + z(i)\\big).","title":"Expressing Reductions"},{"location":"computations.html#performing-the-computation","text":"Once a tensor algebra computation has been defined (and all of the inputs have been initialized ), you can simply invoke the output tensor's evaluate method to perform the actual computation: A.evaluate(); // Perform the computation defined previously for output tensor A Under the hood, when you invoke the evaluate method, TACO first invokes the output tensor's compile method to generate kernels that assembles the output indices (if the tensor contains any sparse dimensions) and that performs the actual computation. TACO would then call the two generated kernels by invoking the output tensor's assemble and compute methods. You can manually invoke these methods instead of calling evaluate as demonstrated below: A.compile(); // Generate output assembly and compute kernels A.assemble(); // Invoke the output assembly kernel to assemble the output indices A.compute(); // Invoke the compute kernel to perform the actual computation This can be useful if you want to perform the same computation multiple times, in which case it suffices to invoke compile once before the first time the computation is performed.","title":"Performing the Computation"},{"location":"computations.html#lazy-execution","text":"It is also possible to compute on tensors without having to explicitly invoke compile , assemble , or compute . Once you attempt to modify or view the output of a computation, TACO would automatically invoke those methods if necessary in order to compute the values in the output tensor. If the input to a computation is itself the output of another computation, then TACO would also automatically ensure that the latter computation is fully executed first.","title":"Lazy Execution"},{"location":"controlling_memory.html","text":"Controlling Memory When using the TACO C++ library, the typical usage is to declare your input taco::Tensor structures, then add data to these structures using the insert method. This is wasteful if the data is already loaded into memory in a compatible format; TACO can use this data directly without copying it. Below are some usage examples for common situations where a user may want to do this. CSR Matrix A two-dimensional CSR matrix can be created using three arrays: rowptr (array of int ): list of indices in colidx representing starts of rows colidx (array of int ): list of column indices of non-zero values vals (array of T for Tensortaco::Tensor
objects, which correspond to mathematical tensors, form the core
+of the TACO C++ API. You can declare a new tensor by specifying its name, a
+vector containing the size of each dimension of the tensor, and the storage
+format that will be used to store the
+tensor:
// Declare a new tensor "A" of double-precision floats with dimensions
+// 512 x 64 x 2048, stored as a dense-sparse-sparse tensor
+Tensor<double> A("A", {512,64,2048}, Format({Dense,Sparse,Sparse}));
+The name of the tensor can be omitted, in which case TACO will assign an +arbitrary name to the tensor:
+// Declare another tensor with the same dimensions and storage format as before
+Tensor<double> A({512,64,2048}, Format({Dense,Sparse,Sparse}));
+Scalars, which are treated as order-0 tensors, can be declared and initialized +with some arbitrary value as demonstrated below:
+Tensor<double> alpha(42.0); // Declare a scalar tensor initialized to 42.0
+Conceptually, you can think of a tensor as a tree with each level (excluding +the root) corresponding to a dimension of the tensor. Each path from the root +to a leaf node represents a tensor coordinate and its corresponding value. +Which dimension each level of the tree corresponds to is determined by the +order in which dimensions of the tensor are stored.
+TACO uses a novel scheme that can describe different storage formats for any +tensor by specifying the order in which tensor dimensions are stored and +whether each dimension is sparse or dense. A sparse dimension stores only the +subset of the dimension that contains non-zero values and is conceptually +similar to the index arrays used in the compressed sparse row (CSR) matrix +format, while a dense dimension stores both zeros and non-zeros. As +demonstrated below, this scheme is flexibile enough to express many +commonly-used matrix storage formats.
+You can define a new tensor storage format by creating a taco::Format
object.
+The constructor for taco::Format
takes as arguments a vector specifying the
+type of each dimension and (optionally) a vector specifying the order in which
+dimensions are to be stored, following the above scheme:
Format dm({Dense,Dense}); // (Row-major) dense matrix
+Format csr({Dense,Sparse}); // Compressed sparse row matrix
+Format csc({Dense,Sparse}, {1,0}); // Compressed sparse column matrix
+Format dcsr({Sparse,Sparse}, {1,0}); // Doubly compressed sparse column matrix
+Alternatively, you can define a tensor format that contains only sparse or +dense dimensions as follows:
+Format csf(Sparse); // Compressed sparse fiber tensor
+You can initialize a taco::Tensor
by calling the insert
method to add a
+non-zero component to the tensor. The insert
method takes two arguments, a
+vector specifying the coordinate of the non-zero component to be added and the
+value to be inserted at that coordinate:
A.insert({128,32,1024}, 42.0); // A(128,32,1024) = 42.0
+The insert
method adds the inserted non-zeros to a temporary buffer. Before a
+tensor can actually be used in a computation though, you must invoke the pack
+method to compress the tensor into the storage format that was specified when
+the tensor was first declared:
A.pack(); // Construct dense-sparse-sparse tensor containing inserted non-zeros
+Rather than manually invoking insert
and pack
to initialize a tensor, you
+can load tensors directly from file by calling taco::read
as demonstrated
+below:
// Load a dense-sparse-sparse tensor from file A.tns
+A = read("A.tns", Format({Dense, Sparse, Sparse}));
+By default, taco::read
returns a packed tensor. You can optionally pass a
+Boolean flag as an argument to indicate whether the returned tensor should be
+packed or not:
// Load an unpacked tensor from file A.tns
+A = read("A.tns", Format({Dense, Sparse, Sparse}), false);
+Currently, TACO supports loading from the following matrix and tensor file +formats:
+ +You can also write a (packed) tensor directly to file by calling taco::write
,
+as demonstrated below:
write("A.tns", A); // Write tensor A to file A.tns
+taco::write
supports the same set of matrix and tensor file formats as
+taco::read
.
The linked Jupyter notebooks proivde an interactive introduction to the TACO +Python library, including how to initialize tensors, define mode formats, and +perform computations. There are three notebooks, which differ mainly in the +final extended example: +SpMV +(useful for scientific computing), +SDDMM +(machine learning), and +MTTKRP +(data analytics).
+These notebooks are hosted online and do not require any installation; +i.e., they can be run without having TACO, Jupyter, or even Python installed +locally. However, they may take a minute or two to build.
+If, on the other hand, you would like to run the notebooks on your computer, +please do the following:
+