-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mana: poll HWC EQE after interrupt wait #814
mana: poll HWC EQE after interrupt wait #814
Conversation
jimdaubert-ms
commented
Feb 7, 2025
- Use wait and poll cycles to allow HWC command to succeed when new EQE is found regardless of interrupt wait success.
- Detect missed interrupt condition and report in logging and in shmem command for soc logging.
Tagging for consideration to include in the 2411 release because this seems to be about detecting and providing insight into MANA device errors. |
- Use wait and poll cycles to allow HWC command to succeed when new EQE is found regardless of interrupt wait success. - Detect missed interrupt condition and report in logging and in shmem command for soc logging.
…in true/false case
- Vary wait time starting at min 20 ms then doubling to max 500 ms and to 10 sec timeout, allowing cases of msix write loss (from non-fixed socmana versions possibly encountered) to result in minimal delay
9157546
to
d00843c
Compare
I took a look at this code. While I don't know the MANA-specific elements, the code looks good to me on the surface as an incremental change (there is some refactoring here that might make sense, but that expands the scope of this change). @erfrimod, are you happy with this change? |
@mattkur the initial idea to was to be very minimal, just poll for an EQE after the 10-second timeout fails, but the scope has crept a bit, first making it a poll wait loop with short waits, then per @jstarks request a refactor to separate the poll/wait from all the logging stuff, then another I made yesterday with @jstarks 's OK to always wait instead of polling first in order to increase chances of catching and logging the missed interrupt. I also added a response struct which refactored and cleaned up quite a bit IMO. Finally, I changed the inter-poll wait to be very short initially (20ms initially) then double each time up to 500 ms so that if this updated mana driver encounters the unfixed socmana with known issue of msix write loss, the delay will be negligible. So all this said I think the scope has crept enough to result in a couple of nice refactors with good results. I'd definitely be interested in what else you see as needed in refactor, and @erfrimod may want to take these on. I agree the CQ/EQ polling code could use a review if not rework. But as the original goal of this PR was to be a targeted fix so that we never have a "lost interrupt" result in HWC timeout (when otherwise the HWC is completely healthy with interrupts working for all other commands), this PR IMO meets that with as reasonably sized increment and reasonable refactor. |
Thanks for the additional details. Apologies: I did not mean that your changes specifically could use refactoring or to demean the cleanup that comes with your changes. Rather, some of the existing code could be cleaned up as it has grown in scope (for example: the abstractions of the |
} | ||
} | ||
} | ||
|
||
async fn process_eqs_or_wait(&mut self) -> anyhow::Result<()> { | ||
let eqe_wait_result = self.process_eqs_or_wait_with_retry().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style, but more a comment as I continue to build up my intuition in rust code (e.g. feel free to tell me this is silly!): it seems that eqe_found
is overloaded here. A tuple response would have been more explicit. Something like the following:
async fn process_eqs_or_wait_with_retry(&mut self) -> (bool, EqeWaitResult, anyhow::Result<()>) {
...
// Exit with no eqe found if timeout occurs.
if eqe_wait_result.elapsed >= self.hwc_timeout_in_ms as u128 {
break (false, eqe_wait_result, Ok(()));
}
....
let (wait_failed, eqe_wait_result, r) = self.process_eqs_or_wait_with_retry().await;
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had returned a tuple response in an earlier iterations -- 5 or 6 element tuple. This got unwieldy partly since with that many elements cargo fmt
prefers each element on a separate line, not one nice break line like you show. Each break became a many line affair. I therefore added the struct. At first, I did show exactly as you suggest with eqe_found, the result struct, and a result value. In the end I thought packaging all in a struct looked cleanest. I don't quite follow how any of this means eqe_found is overloaded. Whether member of struct or a separate tuple element eqe_found would be equally loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah, alright. That's what I get for coming late. Thanks Jim.
Should take for release/2411 |