Enable flash attention #5

s-ryosky · 2024-12-06T14:47:24Z

Thank you for sharing your codes!

I am trying to apply flash attention to MHA as well as StreamPETR, but a loss becomes NaN during the training.
Have you ever encountered this phenomenon?
And do you know of a solution to this problem?

AlmoonYsl · 2024-12-07T06:45:44Z

@s-ryosky Hi,
This problem is caused by the fp16 calculation used in FlashAttention .You can try to use memory-efficient attention implemented by the xformers for fp32 calculation in attention.

s-ryosky · 2024-12-09T13:35:44Z

@AlmoonYsl
Thank you for your reply. I'll check it.

However, I don't think that object-wise position embedding itself is incompatible with flash-attention.
Compared to StreamPETR, there seems to be differences regarding inverse sigmoid and pos2posemb3d used during position embedding.
Do you think this is one of the reasons for the training instability?

AlmoonYsl · 2025-01-03T12:27:07Z

@AlmoonYsl Thank you for your reply. I'll check it.

However, I don't think that object-wise position embedding itself is incompatible with flash-attention. Compared to StreamPETR, there seems to be differences regarding inverse sigmoid and pos2posemb3d used during position embedding. Do you think this is one of the reasons for the training instability?

I think there may be a numerical overflow problem when calculating the position embedding and its gradient in fp16.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable flash attention #5

Enable flash attention #5

s-ryosky commented Dec 6, 2024

AlmoonYsl commented Dec 7, 2024

s-ryosky commented Dec 9, 2024

AlmoonYsl commented Jan 3, 2025

Enable flash attention #5

Enable flash attention #5

Comments

s-ryosky commented Dec 6, 2024

AlmoonYsl commented Dec 7, 2024

s-ryosky commented Dec 9, 2024

AlmoonYsl commented Jan 3, 2025