You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During investigation of #2208 I found out that most of the kernels for tensor operations OP(α1 * A, α2 * B) + β *C are highly inefficient and overengineered.
Even the simples 1D case, which is either linear by-element operation for 3 equal tensors or B can be scalar constant looks like multiple nested loops with extra index computations and so on. I could be a persistent thread model, but no. Number of groups and work per group are calculated in a strange way and GPU is heavily underloaded.
__kernel void Op1dTensorGeneric(...)
{
int gid = get_group_id(0);
global MIOPEN_TYPE* a_off = a + Aoffset;
global MIOPEN_TYPE* b_off = b + Boffset;
global MIOPEN_TYPE* c_off = c + Coffset;
// num_wg: the number of workgroups should be launched
// MAX_NUM_WG: the maximum number of workgroups actually launched
for(; gid < num_wg; gid += MAX_NUM_WG)
{
int lid = get_local_id(0);
int o_n_gid_off = gid % b_n;
int bindex = o_n_gid_off;
MIOPEN_TYPE operand = b_off[bindex] * alpha1;
while(lid < work_per_wg)
{
int o_n = (bitmap & (1 << 0)) ? o_n_gid_off : lid % c_n;
c_off[o_n] = MIOPEN_TENSOR_OP(a_off[o_n] * alpha0, operand) + beta * c_off[o_n];
lid += get_local_size(0);
}
}
}
Simplifying this code can make everything more clear and supportable and even 8-10 times faster.
There is the same problem for 2d, 3d, 4d and 5d operations, though some of the 2d and 3d cases have some kind of "specialized" kernels which were invented as workarounds for such slow generic ones.
Also, 2d, 3d, 4d and 5d kernels do not support stride in the last dimension which contradicts external API and breaks transposed tensors support. Performance tuning of transposed operations must be a separated task, but for correctness it must work at least somehow.
I would really appreciate any help with testing and performance regression analysis - I want to discuss performance metrics, test cases and performance goals for this task.
The text was updated successfully, but these errors were encountered:
@junliume@JehandadKhan could you assign proper urgency and importance of this issue and probably plan it accordingly?
This improvement has a great potential, but I'm not sure about direct impact and immediate value of it. The only thing I know it is exposed to external API and can be used by someone (or ignored because it's very slow) and it is used internally for RNN (but it has few workarounds to cover generic case slowness).
During investigation of #2208 I found out that most of the kernels for tensor operations
OP(α1 * A, α2 * B) + β *C
are highly inefficient and overengineered.Even the simples 1D case, which is either linear by-element operation for 3 equal tensors or B can be scalar constant looks like multiple nested loops with extra index computations and so on. I could be a persistent thread model, but no. Number of groups and work per group are calculated in a strange way and GPU is heavily underloaded.
Simplifying this code can make everything more clear and supportable and even 8-10 times faster.
There is the same problem for 2d, 3d, 4d and 5d operations, though some of the 2d and 3d cases have some kind of "specialized" kernels which were invented as workarounds for such slow generic ones.
Also, 2d, 3d, 4d and 5d kernels do not support stride in the last dimension which contradicts external API and breaks transposed tensors support. Performance tuning of transposed operations must be a separated task, but for correctness it must work at least somehow.
I would really appreciate any help with testing and performance regression analysis - I want to discuss performance metrics, test cases and performance goals for this task.
The text was updated successfully, but these errors were encountered: