The DBM is a drop-in replacement for DBCSR written in C. For the time being only features required by DBT are implemented.
The DBM uses coordinate lists as internal storage format. An existing block is represented by the following data structure:
typedef struct {
int row; // zero based
int col;
int offset;
float norm;
} dbm_block_t;
The norm
is cached for performance reasons.
A negative value indicates that the norm is invalid and needs to be recomputed.
To allow for efficient OpenMP parallelism the blocks are sharded via round-robin:
const int ishard = row % matrix->nshards;
The communication scheme in dbm_multiply_comm.c is decoupled from the local multiplication in dbm_multiply.c via the iterator pattern:
while (dbm_comm_iterator_next(iter, &pack_a, &pack_b)) {
backend_upload_packs(pack_a, pack_b, ctx);
multiply_packs(transa, transb, alpha, pack_a, pack_b, matrix_a, matrix_b,
matrix_c, rows_left_max_eps, flop, ctx);
}
The last stage of the multiplication are the backends for specific hardware, e.g. CPU and CUDA. They are passed batches of task for processing. Each task describes a single block multiplication. A simplest backend implementation looks like this:
for (int itask = 0; itask < ntasks; itask++) {
const dbm_task_t task = batch[itask];
const int lda = (transa) ? task.k : task.m;
const int ldb = (transb) ? task.n : task.k;
const int ldc = task.m;
const double *data_a = &pack_a->data[task.offset_a];
const double *data_b = &pack_b->data[task.offset_b];
double *data_c = &shard_c->data[task.offset_c];
dgemm(transa, transb, task.m, task.n, task.k, alpha, data_a, lda, data_b, ldb, 1.0, data_c, ldc);
}
The dbm_miniapp.x
binary allows to run a simple smoke test.
$ cd cp2k/src/dbm
$ make -j
$ OMP_NUM_THREADS=2 ./dbm_miniapp.x
MPI ranks: 1
OpenMP threads: 2
reserve blocks: 0.047 seconds
matrix multiply: 0.001 s, 2.1 MFLOP/s
done :-)