Mismatch between CPU and NPU in simple Conv2D #210

deneriz-veridas · 2024-08-21T11:39:38Z

Hi,

We have been working with the VX delegate to execute TFLite models on the NPU of the i.MX 8M Plus which is a VeriSilicon's VIPNano-SI+. Doing so, we have found that there are mismatches between the execution of the model in CPU and in NPU, even with a model with a single Conv2D with a 3x3 kernel and padding 'same'. This plot shows the distribution of this mismatch.

Even more, this mismatch errors propagate along different layers across the model. This file (conv-sequence.zip) contains the descomposition of a model with 20 Conv2D layers into 20 models, each of them adding one layer to the previous one, allowing the measurement of the mismatch after each of the layers. The following plot shows this propagation across the model.

Is there a way to avoid this mismatch? Is this a known issue with this NPU?

We are using TFLite Runtime 2.9.1.1 and the forked iMX delegate under version lf-5.15.71_2.2.0.

jetxeberria · 2024-08-21T11:54:43Z

I'm also seeing mismatch between CPU and NPU executions. This is very annoying! Help please!

sunshinemyson · 2024-08-28T07:31:54Z

@deneriz-veridas @jetxeberria ,

Thanks for your feedback. Very nice data analysis. Our NPU integer math is not bit-accurate compare to tflite CPU implementation - for single layer 1-bit distance.

In our practise, the difference doesn't impact the top-1 accuracy in mobilenet-v1. we usually check the result from application POV such as label/box, not compare the absolute error between cpu and npu.

deneriz-veridas · 2024-09-03T06:53:43Z

Hi @sunshinemyson,

Thanks for having a look to this issue. I understand this errors can have minimal impact in classification applications. However, we working with a model that generates embeddings, which we use to compute the distance between them. In this application, the errors are much more important.

Could you extend more on the bit-accuracy of the NPU integer math? Do you have characterized when this happens? We are looking for a way to avoid or mitigate this. Thanks in advance!

sunshinemyson · 2025-01-07T06:38:31Z

@deneriz-veridas ,

The major problem is the rounding mode difference between our HW impl and CPU. We have different accumulate buffer depth than CPU - CPU use 64-bit. CPU did double rounding when convert to 8bit output but we only do rounding once.

Sorry for late reply.

hgaiser · 2025-01-07T06:55:10Z

@sunshinemyson since this is a HW difference, there is no way around this issue?

It seems that these differences don't impact classification that much, but in my case I'm doing semantic segmentation. Essentially it's two networks, one that encodes an embedding of an image and a decoder that uses that embedding to compute the segmentation. The differences introduced by the NPU on the embedding are significantly large, such that the decoder doesn't understand the embedding and fails to segment the object entirely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch between CPU and NPU in simple Conv2D #210

Mismatch between CPU and NPU in simple Conv2D #210

deneriz-veridas commented Aug 21, 2024

jetxeberria commented Aug 21, 2024

sunshinemyson commented Aug 28, 2024

deneriz-veridas commented Sep 3, 2024

sunshinemyson commented Jan 7, 2025

hgaiser commented Jan 7, 2025

Mismatch between CPU and NPU in simple Conv2D #210

Mismatch between CPU and NPU in simple Conv2D #210

Comments

deneriz-veridas commented Aug 21, 2024

jetxeberria commented Aug 21, 2024

sunshinemyson commented Aug 28, 2024

deneriz-veridas commented Sep 3, 2024

sunshinemyson commented Jan 7, 2025

hgaiser commented Jan 7, 2025