Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] HealthNodeUpgradeIT testHealthNode {upgradedNodes=1} failing #118157

Closed
elasticsearchmachine opened this issue Dec 6, 2024 · 11 comments
Closed
Assignees
Labels
:Data Management/Health low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Dec 6, 2024

Build Scans:

Reproduction Line:

./gradlew ":qa:rolling-upgrade:v8.6.2#bwcTest" -Dtests.class="org.elasticsearch.upgrades.HealthNodeUpgradeIT" -Dtests.method="testHealthNode {upgradedNodes=1}" -Dtests.seed=C8A826B8E990C4F9 -Dtests.bwc=true -Dtests.locale=hu-Latn-HU -Dtests.timezone=EST -Druntime.java=23

Applicable branches:
8.x

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:44001], URI [_internal/_health], status line [HTTP/1.1 405 Method Not Allowed]
{"error":"Incorrect HTTP method for uri [_internal/_health] and method [GET], allowed: [POST]","status":405}

Issue Reasons:

  • [8.x] 12 consecutive failures in step 8.6.2_bwc
  • [8.x] 13 consecutive failures in step 8.5.3_bwc
  • [8.x] 25 failures in test testHealthNode {upgradedNodes=1} (2.6% fail rate in 961 executions)
  • [8.x] 12 failures in step 8.6.2_bwc (100.0% fail rate in 12 executions)
  • [8.x] 13 failures in step 8.5.3_bwc (100.0% fail rate in 13 executions)
  • [8.x] 12 failures in pipeline elasticsearch-periodic (100.0% fail rate in 12 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :StorageEngine/Mapping The storage related side of mappings >test-failure Triaged test failures from CI Team:StorageEngine needs:risk Requires assignment of a risk label (low, medium, blocker) labels Dec 6, 2024
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@kkrik-es kkrik-es assigned kkrik-es and unassigned kkrik-es Dec 6, 2024
@kkrik-es kkrik-es added Team:Data Management Meta label for data/management team :Data Management/Health and removed Team:StorageEngine :StorageEngine/Mapping The storage related side of mappings labels Dec 6, 2024
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.x

Mute Reasons:

  • [8.x] 13 consecutive failures in step 8.5.3_bwc
  • [8.x] 11 consecutive failures in step 8.6.2_bwc
  • [8.x] 24 failures in test testHealthNode {upgradedNodes=1} (2.6% fail rate in 934 executions)
  • [8.x] 13 failures in step 8.5.3_bwc (100.0% fail rate in 13 executions)
  • [8.x] 11 failures in step 8.6.2_bwc (100.0% fail rate in 11 executions)
  • [8.x] 12 failures in pipeline elasticsearch-periodic (100.0% fail rate in 12 executions)

Build Scans:

@dakrone dakrone added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Dec 17, 2024
@PeteGillinElastic
Copy link
Member

This (and #118158) complains that the GET /_internal/_health endpoint doesn't exist when running the BWC tests for 8.5.3 and 8.6.2. This is because we renamed that endpoint in 8.7.0. (It was very briefly at GET /_health but settled on GET /_health_report.)

@PeteGillinElastic
Copy link
Member

For the record, this entire test was removed from main, but obviously still runs in these BWC cases.

@PeteGillinElastic
Copy link
Member

This looks related to #106933 although that is a different exception. At any rate, my understanding is that no change made to the test code in main will affect the BWC test (which uses the test code from the historic tag).

My feeling is that we can just suppress these tests permanently (or for as long as we run the 8.x BWC tests). The _internal endpoint was never meant to be backwards compatible so this seems fine.

@PeteGillinElastic
Copy link
Member

For the record again: These failures are all on 8.x. We don't do the 8.5.3 or 8.6.2 BWC tests on main.

@PeteGillinElastic
Copy link
Member

Wait, I don't think this test is doing what I had assumed it was doing. When I run the test from 8.x, it seems to be taking the test code from the head of that branch. (The test class doesn't even exist at 8.6.2 or 8.5.3.) But if I get the test to do GET / then it reports that the node is version 8.18.0-SNAPSHOT.

@PeteGillinElastic
Copy link
Member

Ah, okay. When I make the test do GET /_cluster/state?filter_path=nodes_features,nodes.*.version, I get this:

{nodes_features=[{features=[], node_id=-wPGWBI-TeOcB9mwggk1sw}, {features=[], node_id=05OafmF5QyCHuTqfkrChpw}, {features=[], node_id=44gnsY7HTsK4sOaHDGEMwg}], nodes={44gnsY7HTsK4sOaHDGEMwg={version=8.18.0}, -wPGWBI-TeOcB9mwggk1sw={version=8.6.2}, 05OafmF5QyCHuTqfkrChpw={version=8.6.2}}}

So this test is running a mixture of 8.6.2 and 8.18.0. I expect that this would not report having the health.supports_health_report_api feature (because that was enabled from 8.7.0 and the cluster feature is only considered present if it is present on all nodes) and so the test will try to GET _internal/_health. That would succeed if the request got served by an 8.6.2 node but not if it got served by an 8.18.0 node. So I think whether the test passes or fails depends which node that request goes to. I'm not even sure whether that's deterministic.

@PeteGillinElastic
Copy link
Member

I'm coming back to the view that we should suppress the test for the 8.5.3 and 8.6.2 upgrades. I don't think we can reliably do that test when some of the nodes in the cluster have _internal/_health and some of them have _health_report.

PeteGillinElastic added a commit to PeteGillinElastic/elasticsearch that referenced this issue Jan 7, 2025
This excludes the `HealthNodeUpgradeIT` test for the rolling upgrade
tests which use a cluster with a mix of either 8.5.3 or 8.6.2 nodes,
which serve the health endpoint at `_internal/_health`, and 8.last
nodes, which serve it at `_health_report`. There is no sensible and
reliable way to test the endpoint in such clusters.

Closes elastic#118157
Closes elastic#118158
PeteGillinElastic added a commit that referenced this issue Jan 8, 2025
* Skip HealthNodeUpgradeIT for some rolling upgrades

This skips part of the `HealthNodeUpgradeIT` test for the rolling
upgrade tests which use a cluster with a mix of 8.5.x and 8.6.x nodes,
which serve the health endpoint at `_internal/_health`, and 8.last
nodes, which serve it at `_health_report`. There is no sensible and
reliable way to test the endpoint in such clusters.

Closes #118157
Closes #118158
@PeteGillinElastic
Copy link
Member

Fixed by c124f1b .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Health low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants