Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Improve p2p tx propagation functional test #9762

Merged
merged 1 commit into from
Feb 14, 2025

Conversation

iamamyth
Copy link

@iamamyth iamamyth commented Feb 1, 2025

Reduce the likelihood of false positive failures in the p2p transaction propagation functional test by waiting up to a maximum timeout for a transaction to propagate, rather than using a fixed timeout, to reflect the random delay of Dandelion++ transaction propagation. This strategy also speeds test execution in cases where propagation occurs faster than the previously expected fixed delay.

for daemon in [daemon2, daemon3]:
# Due to Dandelion++, the network propagates transactions with a
# random delay, so poll for the transaction with a timeout
timeout = 16
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtnerd Do you know the expected maximum propagation delay in this scenario, based on the D++ paper? That value would act as an approximate lower bound for the timeout, which could then be further padded to reflect physical realities of data transmission and processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delay should usually trigger the inbound fluff delay where it's using a poisson distribution, simply mimicking what Bitcoin was doing for the same situation. 95% of the values are in the 3-7.3 second range. A delay of 16 is basically nearly impossibl, so this should be acceptable.

As an explanation of why not exponential here - one of the nodes will have no outbound peers. I don't recall the paper stating how to handle the situation, so I decided to make it immediately fluff, as it would be an edge case. A fluff does randomized poisson delays, similar to what Bitcoin was doing at the time. I don't recall the d++ specifying how to handle fluff precisely either, maybe I need to revisit that paper.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtnerd Per discussion in #9755, it might help to make this upper bound a bit more strict. As a 0% failure probability would be impossible for any test (CPUs can fail, etc), I'd aim for something around 1 in 1 million. How much could I reduce the timeout and achieve that bound?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I've done the math horribly wrong, it looks like 13s should yield a failure probability of roughly 1 in a billion (I decided a million was too low); I've updated the test timeout accordingly.

@vtnerd
Copy link
Contributor

vtnerd commented Feb 2, 2025

I guess you were convinced the failure was primarily the sleep timeout? That was my assessment, as it seemed like an obvious issue.

@iamamyth iamamyth force-pushed the tests-p2p-tx-propagation branch from eb9cb97 to 6d147d1 Compare February 2, 2025 06:26
@iamamyth
Copy link
Author

iamamyth commented Feb 2, 2025

I think the low, fixed timeout generates quite a few false positive failures, consistent with the observed behavior of this test in CI. If any actual transaction propagation errors exist, I would expect they do not owe to recent connection management commits. I just modified the test to better differentiate propagation to 0-2 daemons, which might make it a bit more useful.

@iamamyth iamamyth force-pushed the tests-p2p-tx-propagation branch 2 times, most recently from 216523d to f918d48 Compare February 2, 2025 19:29
@iamamyth
Copy link
Author

iamamyth commented Feb 2, 2025

Test failure is a known-flaky unit test (node_server.race_condition) unrelated to this PR.

@iamamyth
Copy link
Author

iamamyth commented Feb 5, 2025

@vtnerd I made one minor change to show how many daemons see the transactions, but I think this will clear up the p2p test issue.

@iamamyth
Copy link
Author

iamamyth commented Feb 6, 2025

An example failure in a recent CI build (without this change): https://github.com/monero-project/monero/actions/runs/13163403906/job/36738631593?pr=9771.

Every failure I've seen on CI (and I've seen quite a few, at this point) is the same behavior, it's not the RPC refusing the connection, or generating a garbage response, it's simply the transaction not appearing in the pool by design; the failures are successes and the test has a wrong methodology.

@selsta
Copy link
Collaborator

selsta commented Feb 9, 2025

Reduce the likelihood of false positive failures in the p2p transaction propagation functional test by waiting up to a maximum timeout for a transaction to propagate, rather than using a fixed timeout, to reflect the random delay of Dandelion++ transaction propagation.

I'm not against this patch but what I don't understand is why this test only recently started to fail randomly, if it's just timeout probability. To me it seems there's more to it.

@iamamyth
Copy link
Author

iamamyth commented Feb 9, 2025

I'm not against this patch but what I don't understand is why this test only recently started to fail randomly, if it's just timeout probability. To me it seems there's more to it.

I don't know when the failures started (CI has so many broken tests, even now, that I imagine many people just ignore it). However, as stated, due to the test's structural flaw, if, in fact, it used to pass and now fails, that may reflect an improvement in Dandelion++ or networking code, because it should have had a moderate probability of failure in the past. Of course, it could just as easily look better now because of some new bug producing interference; a proper investigation of that possibility would require establishing a timeline to narrow the search. None of the aforementioned possibilities alter the correctness of this patch, which will make the test actually useful in the future, rather than noisy, broken, and unable to surface relevant contextual information, such as how many daemons received the transaction.

Reduce the likelihood of false positive failures in the p2p
transaction propagation functional test by waiting up to a
maximum timeout for a transaction to propagate, rather than using a
fixed timeout, to reflect the random delay of Dandelion++ transaction
propagation. This strategy also speeds test execution in cases where
propagation occurs faster than the previously expected fixed delay.
@iamamyth iamamyth force-pushed the tests-p2p-tx-propagation branch from f918d48 to 950ddbf Compare February 12, 2025 02:30
@selsta
Copy link
Collaborator

selsta commented Feb 13, 2025

@iamamyth please also open against release-v0.18.

@tobtoht tobtoht merged commit 28e2042 into monero-project:master Feb 14, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants