mana: poll HWC EQE after interrupt wait #814

jimdaubert-ms · 2025-02-07T19:15:36Z

Use wait and poll cycles to allow HWC command to succeed when new EQE is found regardless of interrupt wait success.
Detect missed interrupt condition and report in logging and in shmem command for soc logging.

vm/devices/net/mana_driver/src/gdma_driver.rs

mattkur · 2025-02-10T14:27:12Z

Tagging for consideration to include in the 2411 release because this seems to be about detecting and providing insight into MANA device errors.

- Use wait and poll cycles to allow HWC command to succeed when new EQE is found regardless of interrupt wait success. - Detect missed interrupt condition and report in logging and in shmem command for soc logging.

…in true/false case

…poll/wait loop

- Vary wait time starting at min 20 ms then doubling to max 500 ms and to 10 sec timeout, allowing cases of msix write loss (from non-fixed socmana versions possibly encountered) to result in minimal delay

mattkur · 2025-02-11T16:40:03Z

I took a look at this code. While I don't know the MANA-specific elements, the code looks good to me on the surface as an incremental change (there is some refactoring here that might make sense, but that expands the scope of this change). @erfrimod, are you happy with this change?

jimdaubert-ms · 2025-02-11T17:36:22Z

I took a look at this code. While I don't know the MANA-specific elements, the code looks good to me on the surface as an incremental change (there is some refactoring here that might make sense, but that expands the scope of this change). @erfrimod, are you happy with this change?

@mattkur the initial idea to was to be very minimal, just poll for an EQE after the 10-second timeout fails, but the scope has crept a bit, first making it a poll wait loop with short waits, then per @jstarks request a refactor to separate the poll/wait from all the logging stuff, then another I made yesterday with @jstarks 's OK to always wait instead of polling first in order to increase chances of catching and logging the missed interrupt. I also added a response struct which refactored and cleaned up quite a bit IMO. Finally, I changed the inter-poll wait to be very short initially (20ms initially) then double each time up to 500 ms so that if this updated mana driver encounters the unfixed socmana with known issue of msix write loss, the delay will be negligible.

So all this said I think the scope has crept enough to result in a couple of nice refactors with good results.

I'd definitely be interested in what else you see as needed in refactor, and @erfrimod may want to take these on. I agree the CQ/EQ polling code could use a review if not rework. But as the original goal of this PR was to be a targeted fix so that we never have a "lost interrupt" result in HWC timeout (when otherwise the HWC is completely healthy with interrupts working for all other commands), this PR IMO meets that with as reasonably sized increment and reasonable refactor.

mattkur · 2025-02-11T17:43:24Z

I took a look at this code. While I don't know the MANA-specific elements, the code looks good to me on the surface as an incremental change (there is some refactoring here that might make sense, but that expands the scope of this change). @erfrimod, are you happy with this change?

@mattkur the initial idea to was to be very minimal, just poll for an EQE after the 10-second timeout fails, but the scope has crept a bit, first making it a poll wait loop with short waits, then per @jstarks request a refactor to separate the poll/wait from all the logging stuff, then another I made yesterday with @jstarks 's OK to always wait instead of polling first in order to increase chances of catching and logging the missed interrupt. I also added a response struct which refactored and cleaned up quite a bit IMO. Finally, I changed the inter-poll wait to be very short initially (20ms initially) then double each time up to 500 ms so that if this updated mana driver encounters the unfixed socmana with known issue of msix write loss, the delay will be negligible.

So all this said I think the scope has crept enough to result in a couple of nice refactors with good results.

I'd definitely be interested in what else you see as needed in refactor, and @erfrimod may want to take these on. I agree the CQ/EQ polling code could use a review if not rework. But as the original goal of this PR was to be a targeted fix so that we never have a "lost interrupt" result in HWC timeout (when otherwise the HWC is completely healthy with interrupts working for all other commands), this PR IMO meets that with as reasonably sized increment and reasonable refactor.

Thanks for the additional details. Apologies: I did not mean that your changes specifically could use refactoring or to demean the cleanup that comes with your changes. Rather, some of the existing code could be cleaned up as it has grown in scope (for example: the abstractions of the bar used in the nvme code makes it easier for new authors to understand the code). I was trying to say: looks good, I'll resist the urge to scope creep.

mattkur · 2025-02-11T19:30:49Z

vm/devices/net/mana_driver/src/gdma_driver.rs

            }
        }
    }

+    async fn process_eqs_or_wait(&mut self) -> anyhow::Result<()> {
+        let eqe_wait_result = self.process_eqs_or_wait_with_retry().await;


style, but more a comment as I continue to build up my intuition in rust code (e.g. feel free to tell me this is silly!): it seems that eqe_found is overloaded here. A tuple response would have been more explicit. Something like the following:

async fn process_eqs_or_wait_with_retry(&mut self) -> (bool, EqeWaitResult, anyhow::Result<()>) { ... // Exit with no eqe found if timeout occurs. if eqe_wait_result.elapsed >= self.hwc_timeout_in_ms as u128 { break (false, eqe_wait_result, Ok(())); } .... let (wait_failed, eqe_wait_result, r) = self.process_eqs_or_wait_with_retry().await; ...

I had returned a tuple response in an earlier iterations -- 5 or 6 element tuple. This got unwieldy partly since with that many elements cargo fmt prefers each element on a separate line, not one nice break line like you show. Each break became a many line affair. I therefore added the struct. At first, I did show exactly as you suggest with eqe_found, the result struct, and a result value. In the end I thought packaging all in a struct looked cleanest. I don't quite follow how any of this means eqe_found is overloaded. Whether member of struct or a separate tuple element eqe_found would be equally loaded.

hah, alright. That's what I get for coming late. Thanks Jim.

Brian-Perkins · 2025-02-14T18:44:41Z

Should take for release/2411

jimdaubert-ms requested a review from a team as a code owner February 7, 2025 19:15

jstarks reviewed Feb 7, 2025

View reviewed changes

vm/devices/net/mana_driver/src/gdma_driver.rs Outdated Show resolved Hide resolved

erfrimod reviewed Feb 7, 2025

View reviewed changes

vm/devices/net/mana_driver/src/gdma_driver.rs Show resolved Hide resolved

erfrimod reviewed Feb 7, 2025

View reviewed changes

vm/devices/net/mana_driver/src/gdma_driver.rs Outdated Show resolved Hide resolved

erfrimod previously approved these changes Feb 7, 2025

View reviewed changes

jimdaubert-ms dismissed erfrimod’s stale review via 6068bb9 February 8, 2025 00:15

jimdaubert-ms requested a review from erfrimod February 8, 2025 00:21

mattkur added the backport_2411 Change should be backported to the release/2411 branch label Feb 10, 2025

erfrimod previously approved these changes Feb 10, 2025

View reviewed changes

Jim Daubert added 9 commits February 10, 2025 20:28

mana: poll HWC EQE after interrupt wait

dc7436d

- Use wait and poll cycles to allow HWC command to succeed when new EQE is found regardless of interrupt wait success. - Detect missed interrupt condition and report in logging and in shmem command for soc logging.

PR feedback: extract sub function; express more conditions in trace

2b7a7dc

cargo fmt

fbfe5c7

PR cleanup: define for 500 ms, clippy fix (no return), dedup of code …

6a8e563

…in true/false case

trace in success with wait case for parity with prior code

fc7cebb

clippy fixes

9681e69

add EqWaitResult struct to contain various results and stats from eq …

7b12cd4

…poll/wait loop

always wait for interrupt

33ab719

vary interrupt wait in loop from 20 ms to 500 ms

d00843c

- Vary wait time starting at min 20 ms then doubling to max 500 ms and to 10 sec timeout, allowing cases of msix write loss (from non-fixed socmana versions possibly encountered) to result in minimal delay

jimdaubert-ms dismissed erfrimod’s stale review via d00843c February 11, 2025 05:57

jimdaubert-ms force-pushed the user/jadauber/hwc_poll_eqe_after_interrupt_wait branch from 9157546 to d00843c Compare February 11, 2025 05:57

minor trace tweak to have consistent naming with other trace

451e066

jimdaubert-ms requested a review from erfrimod February 11, 2025 17:23

erfrimod approved these changes Feb 11, 2025

View reviewed changes

mattkur reviewed Feb 11, 2025

View reviewed changes

mattkur approved these changes Feb 11, 2025

View reviewed changes

jimdaubert-ms merged commit a29e6af into microsoft:main Feb 11, 2025
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mana: poll HWC EQE after interrupt wait #814

mana: poll HWC EQE after interrupt wait #814

jimdaubert-ms commented Feb 7, 2025

mattkur commented Feb 10, 2025

mattkur commented Feb 11, 2025

jimdaubert-ms commented Feb 11, 2025 •

edited

Loading

mattkur commented Feb 11, 2025

mattkur Feb 11, 2025

jimdaubert-ms Feb 11, 2025

mattkur Feb 11, 2025

Brian-Perkins commented Feb 14, 2025

mana: poll HWC EQE after interrupt wait #814

mana: poll HWC EQE after interrupt wait #814

Conversation

jimdaubert-ms commented Feb 7, 2025

mattkur commented Feb 10, 2025

mattkur commented Feb 11, 2025

jimdaubert-ms commented Feb 11, 2025 • edited Loading

mattkur commented Feb 11, 2025

mattkur Feb 11, 2025

Choose a reason for hiding this comment

jimdaubert-ms Feb 11, 2025

Choose a reason for hiding this comment

mattkur Feb 11, 2025

Choose a reason for hiding this comment

Brian-Perkins commented Feb 14, 2025

jimdaubert-ms commented Feb 11, 2025 •

edited

Loading