Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tensor operations performance #2221

Open
CAHEK7 opened this issue Jun 23, 2023 · 3 comments
Open

Improve tensor operations performance #2221

CAHEK7 opened this issue Jun 23, 2023 · 3 comments

Comments

@CAHEK7
Copy link
Contributor

CAHEK7 commented Jun 23, 2023

During investigation of #2208 I found out that most of the kernels for tensor operations OP(α1 * A, α2 * B) + β *C are highly inefficient and overengineered.

Even the simples 1D case, which is either linear by-element operation for 3 equal tensors or B can be scalar constant looks like multiple nested loops with extra index computations and so on. I could be a persistent thread model, but no. Number of groups and work per group are calculated in a strange way and GPU is heavily underloaded.

__kernel void Op1dTensorGeneric(...)
{
    int gid = get_group_id(0);

    global MIOPEN_TYPE* a_off = a + Aoffset;
    global MIOPEN_TYPE* b_off = b + Boffset;
    global MIOPEN_TYPE* c_off = c + Coffset;

    // num_wg: the number of workgroups should be launched
    // MAX_NUM_WG: the maximum number of workgroups actually launched
    for(; gid < num_wg; gid += MAX_NUM_WG)
    {
        int lid             = get_local_id(0);
        int o_n_gid_off     = gid % b_n;
        int bindex          = o_n_gid_off;
        MIOPEN_TYPE operand = b_off[bindex] * alpha1;
        while(lid < work_per_wg)
        {
            int o_n    = (bitmap & (1 << 0)) ? o_n_gid_off : lid % c_n;
            c_off[o_n] = MIOPEN_TENSOR_OP(a_off[o_n] * alpha0, operand) + beta * c_off[o_n];
            lid += get_local_size(0);
        }
    }
}

Simplifying this code can make everything more clear and supportable and even 8-10 times faster.
There is the same problem for 2d, 3d, 4d and 5d operations, though some of the 2d and 3d cases have some kind of "specialized" kernels which were invented as workarounds for such slow generic ones.

Also, 2d, 3d, 4d and 5d kernels do not support stride in the last dimension which contradicts external API and breaks transposed tensors support. Performance tuning of transposed operations must be a separated task, but for correctness it must work at least somehow.

I would really appreciate any help with testing and performance regression analysis - I want to discuss performance metrics, test cases and performance goals for this task.

@CAHEK7
Copy link
Contributor Author

CAHEK7 commented Sep 25, 2023

It started to be more critical, since it is used in RNN layer.
Potentially it can be used for backward attention layer.

@CAHEK7
Copy link
Contributor Author

CAHEK7 commented Jan 11, 2024

@junliume @JehandadKhan could you assign proper urgency and importance of this issue and probably plan it accordingly?

This improvement has a great potential, but I'm not sure about direct impact and immediate value of it. The only thing I know it is exposed to external API and can be used by someone (or ignored because it's very slow) and it is used internally for RNN (but it has few workarounds to cover generic case slowness).

@junliume
Copy link
Contributor

@JehandadKhan @zjing14 let's discuss this week and I think it should have both kernel and host work involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants