Improve tensor operations performance #2221

CAHEK7 · 2023-06-23T13:45:59Z

During investigation of #2208 I found out that most of the kernels for tensor operations OP(α1 * A, α2 * B) + β *C are highly inefficient and overengineered.

Even the simples 1D case, which is either linear by-element operation for 3 equal tensors or B can be scalar constant looks like multiple nested loops with extra index computations and so on. I could be a persistent thread model, but no. Number of groups and work per group are calculated in a strange way and GPU is heavily underloaded.

__kernel void Op1dTensorGeneric(...)
{
    int gid = get_group_id(0);

    global MIOPEN_TYPE* a_off = a + Aoffset;
    global MIOPEN_TYPE* b_off = b + Boffset;
    global MIOPEN_TYPE* c_off = c + Coffset;

    // num_wg: the number of workgroups should be launched
    // MAX_NUM_WG: the maximum number of workgroups actually launched
    for(; gid < num_wg; gid += MAX_NUM_WG)
    {
        int lid             = get_local_id(0);
        int o_n_gid_off     = gid % b_n;
        int bindex          = o_n_gid_off;
        MIOPEN_TYPE operand = b_off[bindex] * alpha1;
        while(lid < work_per_wg)
        {
            int o_n    = (bitmap & (1 << 0)) ? o_n_gid_off : lid % c_n;
            c_off[o_n] = MIOPEN_TENSOR_OP(a_off[o_n] * alpha0, operand) + beta * c_off[o_n];
            lid += get_local_size(0);
        }
    }
}

Simplifying this code can make everything more clear and supportable and even 8-10 times faster.
There is the same problem for 2d, 3d, 4d and 5d operations, though some of the 2d and 3d cases have some kind of "specialized" kernels which were invented as workarounds for such slow generic ones.

Also, 2d, 3d, 4d and 5d kernels do not support stride in the last dimension which contradicts external API and breaks transposed tensors support. Performance tuning of transposed operations must be a separated task, but for correctness it must work at least somehow.

I would really appreciate any help with testing and performance regression analysis - I want to discuss performance metrics, test cases and performance goals for this task.

The text was updated successfully, but these errors were encountered:

CAHEK7 · 2023-09-25T16:19:37Z

It started to be more critical, since it is used in RNN layer.
Potentially it can be used for backward attention layer.

CAHEK7 · 2024-01-11T14:39:15Z

@junliume @JehandadKhan could you assign proper urgency and importance of this issue and probably plan it accordingly?

This improvement has a great potential, but I'm not sure about direct impact and immediate value of it. The only thing I know it is exposed to external API and can be used by someone (or ignored because it's very slow) and it is used internally for RNN (but it has few workarounds to cover generic case slowness).

junliume · 2024-01-16T02:14:25Z

@JehandadKhan @zjing14 let's discuss this week and I think it should have both kernel and host work involved.

CAHEK7 added performance correctness labels Jun 23, 2023

CAHEK7 self-assigned this Jun 23, 2023

CAHEK7 added quality urgency_unknown value_unknown labels Jun 23, 2023

CAHEK7 mentioned this issue Jun 23, 2023

Improve tensor operations tests #2222

Closed

CAHEK7 mentioned this issue Jul 10, 2023

Op3dTensorGeneric kernel improvements #2208

Closed

CAHEK7 mentioned this issue Jul 17, 2023

OpTensor1d improvement #2262

Merged

This was referenced Aug 11, 2023

Disable Op2dTensorLite kernel for the cases with wrong results #2318

Merged

Op2dTensorLite doesn't handle the last element in some cases (OpTensor3d test failed) #2314

Closed

junliume added value_high urgency_high and removed urgency_unknown value_unknown labels Jan 16, 2024

junliume assigned zjing14 Jan 16, 2024

junliume assigned shurale-nkn and junliume Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve tensor operations performance #2221

Improve tensor operations performance #2221

CAHEK7 commented Jun 23, 2023

CAHEK7 commented Sep 25, 2023

CAHEK7 commented Jan 11, 2024

junliume commented Jan 16, 2024

Improve tensor operations performance #2221

Improve tensor operations performance #2221

Comments

CAHEK7 commented Jun 23, 2023

CAHEK7 commented Sep 25, 2023

CAHEK7 commented Jan 11, 2024

junliume commented Jan 16, 2024