Name		Name	Last commit message	Last commit date
parent directory ..
LICENSE		LICENSE
Makefile		Makefile
PACKAGE		PACKAGE
README.md		README.md
dbm_api.F		dbm_api.F
dbm_distribution.c		dbm_distribution.c
dbm_distribution.h		dbm_distribution.h
dbm_hyperparams.h		dbm_hyperparams.h
dbm_library.c		dbm_library.c
dbm_library.h		dbm_library.h
dbm_matrix.c		dbm_matrix.c
dbm_matrix.h		dbm_matrix.h
dbm_mempool.c		dbm_mempool.c
dbm_mempool.h		dbm_mempool.h
dbm_miniapp.c		dbm_miniapp.c
dbm_mpi.c		dbm_mpi.c
dbm_mpi.h		dbm_mpi.h
dbm_multiply.c		dbm_multiply.c
dbm_multiply.h		dbm_multiply.h
dbm_multiply_comm.c		dbm_multiply_comm.c
dbm_multiply_comm.h		dbm_multiply_comm.h
dbm_multiply_cpu.c		dbm_multiply_cpu.c
dbm_multiply_cpu.h		dbm_multiply_cpu.h
dbm_multiply_cuda.cu		dbm_multiply_cuda.cu
dbm_multiply_cuda.h		dbm_multiply_cuda.h
dbm_multiply_internal.h		dbm_multiply_internal.h
dbm_shard.c		dbm_shard.c
dbm_shard.h		dbm_shard.h

README.md

DBM: Distributed Block-sparse Matrices

The DBM is a drop-in replacement for DBCSR written in C. For the time being only features required by DBT are implemented.

Storage

The DBM uses coordinate lists as internal storage format. An existing block is represented by the following data structure:

typedef struct {
  int row; // zero based
  int col;
  int offset;
  float norm;
} dbm_block_t;

The norm is cached for performance reasons. A negative value indicates that the norm is invalid and needs to be recomputed.

To allow for efficient OpenMP parallelism the blocks are sharded via round-robin:

const int ishard = row % matrix->nshards;

MPI Communication

The communication scheme in dbm_multiply_comm.c is decoupled from the local multiplication in dbm_multiply.c via the iterator pattern:

while (dbm_comm_iterator_next(iter, &pack_a, &pack_b)) {
  backend_upload_packs(pack_a, pack_b, ctx);
  multiply_packs(transa, transb, alpha, pack_a, pack_b, matrix_a, matrix_b,
                 matrix_c, rows_left_max_eps, flop, ctx);
}

Backends

The last stage of the multiplication are the backends for specific hardware, e.g. CPU and CUDA. They are passed batches of task for processing. Each task describes a single block multiplication. A simplest backend implementation looks like this:

for (int itask = 0; itask < ntasks; itask++) {
  const dbm_task_t task = batch[itask];
  const int lda = (transa) ? task.k : task.m;
  const int ldb = (transb) ? task.n : task.k;
  const int ldc = task.m;
  const double *data_a = &pack_a->data[task.offset_a];
  const double *data_b = &pack_b->data[task.offset_b];
  double *data_c = &shard_c->data[task.offset_c];
  dgemm(transa, transb, task.m, task.n, task.k, alpha, data_a, lda, data_b, ldb, 1.0, data_c, ldc);
}

MiniApp

The dbm_miniapp.x binary allows to run a simple smoke test.

$ cd cp2k/src/dbm
$ make -j
$ OMP_NUM_THREADS=2 ./dbm_miniapp.x
MPI ranks:      1
OpenMP threads: 2
reserve blocks: 0.047 seconds
matrix multiply: 0.001 s, 2.1 MFLOP/s
done :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbm

dbm

README.md

DBM: Distributed Block-sparse Matrices

Storage

MPI Communication

Backends

MiniApp

Files

dbm

Directory actions

More options

Directory actions

More options

Latest commit

History

dbm

Folders and files

parent directory

README.md

DBM: Distributed Block-sparse Matrices

Storage

MPI Communication

Backends

MiniApp