Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mana: poll HWC EQE after interrupt wait #814

Conversation

jimdaubert-ms
Copy link
Contributor

  • Use wait and poll cycles to allow HWC command to succeed when new EQE is found regardless of interrupt wait success.
  • Detect missed interrupt condition and report in logging and in shmem command for soc logging.

@jimdaubert-ms jimdaubert-ms requested a review from a team as a code owner February 7, 2025 19:15
erfrimod
erfrimod previously approved these changes Feb 7, 2025
@mattkur mattkur added the backport_2411 Change should be backported to the release/2411 branch label Feb 10, 2025
@mattkur
Copy link
Contributor

mattkur commented Feb 10, 2025

Tagging for consideration to include in the 2411 release because this seems to be about detecting and providing insight into MANA device errors.

erfrimod
erfrimod previously approved these changes Feb 10, 2025
Jim Daubert added 9 commits February 10, 2025 20:28
- Use wait and poll cycles to allow HWC command to succeed when new EQE
  is found regardless of interrupt wait success.
- Detect missed interrupt condition and report in logging and in shmem
  command for soc logging.
- Vary wait time starting at min 20 ms then doubling to max 500 ms and
  to 10 sec timeout, allowing cases of msix write loss (from non-fixed
  socmana versions possibly encountered) to result in minimal delay
@jimdaubert-ms jimdaubert-ms force-pushed the user/jadauber/hwc_poll_eqe_after_interrupt_wait branch from 9157546 to d00843c Compare February 11, 2025 05:57
@mattkur
Copy link
Contributor

mattkur commented Feb 11, 2025

I took a look at this code. While I don't know the MANA-specific elements, the code looks good to me on the surface as an incremental change (there is some refactoring here that might make sense, but that expands the scope of this change). @erfrimod, are you happy with this change?

@jimdaubert-ms
Copy link
Contributor Author

jimdaubert-ms commented Feb 11, 2025

I took a look at this code. While I don't know the MANA-specific elements, the code looks good to me on the surface as an incremental change (there is some refactoring here that might make sense, but that expands the scope of this change). @erfrimod, are you happy with this change?

@mattkur the initial idea to was to be very minimal, just poll for an EQE after the 10-second timeout fails, but the scope has crept a bit, first making it a poll wait loop with short waits, then per @jstarks request a refactor to separate the poll/wait from all the logging stuff, then another I made yesterday with @jstarks 's OK to always wait instead of polling first in order to increase chances of catching and logging the missed interrupt. I also added a response struct which refactored and cleaned up quite a bit IMO. Finally, I changed the inter-poll wait to be very short initially (20ms initially) then double each time up to 500 ms so that if this updated mana driver encounters the unfixed socmana with known issue of msix write loss, the delay will be negligible.

So all this said I think the scope has crept enough to result in a couple of nice refactors with good results.

I'd definitely be interested in what else you see as needed in refactor, and @erfrimod may want to take these on. I agree the CQ/EQ polling code could use a review if not rework. But as the original goal of this PR was to be a targeted fix so that we never have a "lost interrupt" result in HWC timeout (when otherwise the HWC is completely healthy with interrupts working for all other commands), this PR IMO meets that with as reasonably sized increment and reasonable refactor.

@mattkur
Copy link
Contributor

mattkur commented Feb 11, 2025

I took a look at this code. While I don't know the MANA-specific elements, the code looks good to me on the surface as an incremental change (there is some refactoring here that might make sense, but that expands the scope of this change). @erfrimod, are you happy with this change?

@mattkur the initial idea to was to be very minimal, just poll for an EQE after the 10-second timeout fails, but the scope has crept a bit, first making it a poll wait loop with short waits, then per @jstarks request a refactor to separate the poll/wait from all the logging stuff, then another I made yesterday with @jstarks 's OK to always wait instead of polling first in order to increase chances of catching and logging the missed interrupt. I also added a response struct which refactored and cleaned up quite a bit IMO. Finally, I changed the inter-poll wait to be very short initially (20ms initially) then double each time up to 500 ms so that if this updated mana driver encounters the unfixed socmana with known issue of msix write loss, the delay will be negligible.

So all this said I think the scope has crept enough to result in a couple of nice refactors with good results.

I'd definitely be interested in what else you see as needed in refactor, and @erfrimod may want to take these on. I agree the CQ/EQ polling code could use a review if not rework. But as the original goal of this PR was to be a targeted fix so that we never have a "lost interrupt" result in HWC timeout (when otherwise the HWC is completely healthy with interrupts working for all other commands), this PR IMO meets that with as reasonably sized increment and reasonable refactor.

Thanks for the additional details. Apologies: I did not mean that your changes specifically could use refactoring or to demean the cleanup that comes with your changes. Rather, some of the existing code could be cleaned up as it has grown in scope (for example: the abstractions of the bar used in the nvme code makes it easier for new authors to understand the code). I was trying to say: looks good, I'll resist the urge to scope creep.

}
}
}

async fn process_eqs_or_wait(&mut self) -> anyhow::Result<()> {
let eqe_wait_result = self.process_eqs_or_wait_with_retry().await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style, but more a comment as I continue to build up my intuition in rust code (e.g. feel free to tell me this is silly!): it seems that eqe_found is overloaded here. A tuple response would have been more explicit. Something like the following:

async fn process_eqs_or_wait_with_retry(&mut self) -> (bool, EqeWaitResult, anyhow::Result<()>) {
...
            // Exit with no eqe found if timeout occurs.
            if eqe_wait_result.elapsed >= self.hwc_timeout_in_ms as u128 {
                break (false, eqe_wait_result, Ok(()));
            }
....
        let (wait_failed, eqe_wait_result, r) = self.process_eqs_or_wait_with_retry().await;
        ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had returned a tuple response in an earlier iterations -- 5 or 6 element tuple. This got unwieldy partly since with that many elements cargo fmt prefers each element on a separate line, not one nice break line like you show. Each break became a many line affair. I therefore added the struct. At first, I did show exactly as you suggest with eqe_found, the result struct, and a result value. In the end I thought packaging all in a struct looked cleanest. I don't quite follow how any of this means eqe_found is overloaded. Whether member of struct or a separate tuple element eqe_found would be equally loaded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, alright. That's what I get for coming late. Thanks Jim.

@jimdaubert-ms jimdaubert-ms merged commit a29e6af into microsoft:main Feb 11, 2025
26 checks passed
@Brian-Perkins
Copy link
Contributor

Should take for release/2411

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport_2411 Change should be backported to the release/2411 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants