Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between CPU and NPU in simple Conv2D #210

Open
deneriz-veridas opened this issue Aug 21, 2024 · 5 comments
Open

Mismatch between CPU and NPU in simple Conv2D #210

deneriz-veridas opened this issue Aug 21, 2024 · 5 comments

Comments

@deneriz-veridas
Copy link

Hi,

We have been working with the VX delegate to execute TFLite models on the NPU of the i.MX 8M Plus which is a VeriSilicon's VIPNano-SI+. Doing so, we have found that there are mismatches between the execution of the model in CPU and in NPU, even with a model with a single Conv2D with a 3x3 kernel and padding 'same'. This plot shows the distribution of this mismatch.

image

Even more, this mismatch errors propagate along different layers across the model. This file (conv-sequence.zip) contains the descomposition of a model with 20 Conv2D layers into 20 models, each of them adding one layer to the previous one, allowing the measurement of the mismatch after each of the layers. The following plot shows this propagation across the model.

image

Is there a way to avoid this mismatch? Is this a known issue with this NPU?

We are using TFLite Runtime 2.9.1.1 and the forked iMX delegate under version lf-5.15.71_2.2.0.

@jetxeberria
Copy link

I'm also seeing mismatch between CPU and NPU executions. This is very annoying! Help please!

@sunshinemyson
Copy link
Contributor

@deneriz-veridas @jetxeberria ,

Thanks for your feedback. Very nice data analysis. Our NPU integer math is not bit-accurate compare to tflite CPU implementation - for single layer 1-bit distance.

In our practise, the difference doesn't impact the top-1 accuracy in mobilenet-v1. we usually check the result from application POV such as label/box, not compare the absolute error between cpu and npu.

@deneriz-veridas
Copy link
Author

Hi @sunshinemyson,

Thanks for having a look to this issue. I understand this errors can have minimal impact in classification applications. However, we working with a model that generates embeddings, which we use to compute the distance between them. In this application, the errors are much more important.

Could you extend more on the bit-accuracy of the NPU integer math? Do you have characterized when this happens? We are looking for a way to avoid or mitigate this. Thanks in advance!

@sunshinemyson
Copy link
Contributor

@deneriz-veridas ,

The major problem is the rounding mode difference between our HW impl and CPU. We have different accumulate buffer depth than CPU - CPU use 64-bit. CPU did double rounding when convert to 8bit output but we only do rounding once.

Sorry for late reply.

@hgaiser
Copy link

hgaiser commented Jan 7, 2025

@sunshinemyson since this is a HW difference, there is no way around this issue?

It seems that these differences don't impact classification that much, but in my case I'm doing semantic segmentation. Essentially it's two networks, one that encodes an embedding of an image and a decoder that uses that embedding to compute the segmentation. The differences introduced by the NPU on the embedding are significantly large, such that the decoder doesn't understand the embedding and fails to segment the object entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants