Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main - java fips compatibility matrix openjdk17 bwcTestSnapshots general-purpose ccs-rolling-upgrade-remote-cluster #96134

Closed
kingherc opened this issue May 16, 2023 · 33 comments · Fixed by #119710
Labels
low-risk An open issue or test failure that is a low risk to future releases :Security/FIPS Running ES in FIPS 140-2 mode Team:Security Meta label for security team >test-failure Triaged test failures from CI

Comments

@kingherc
Copy link
Contributor

kingherc commented May 16, 2023

CI Link

https://gradle-enterprise.elastic.co/s/br3zbuqz6alce

Repro line

N/A

Does it reproduce?

Didn't try

Applicable branches

main, 8.7, possibly others

Failure history

No response

Failure excerpt



:qa:ccs-rolling-upgrade-remote-cluster:v7.17.11#oldClusterTest FAILED |  
-- | --
  |   |  
  | === Log output of node `node{:qa:ccs-rolling-upgrade-remote-cluster:v7.17.11-local-0}` === |  
  |   |  
  | »    ↓ errors and warnings from /dev/shm/elastic+elasticsearch+main+periodic+java-fips-matrix/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/logs/es.out ↓ |  
  | » [2023-05-15T21:20:22,822][ERROR][o.e.b.Elasticsearch      ] [v7.17.11-local-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: failed to bind service |  
  | »  	at [email protected]/org.elasticsearch.node.Node.<init>(Node.java:1169) |  
  | »  	at [email protected]/org.elasticsearch.node.Node.<init>(Node.java:329) |  
  | »  	at [email protected]/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:216) |  
  | »  	at [email protected]/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:216) |  
  | »  	at [email protected]/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:67) |  
  | »  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.9.0] is only supported from version [7.17.0]. |  
  | »  	at [email protected]/org.elasticsearch.env.NodeEnvironment.checkForIndexCompatibility(NodeEnvironment.java:515) |  
  | »  	at [email protected]/org.elasticsearch.env.NodeEnvironment.upgradeLegacyNodeFolders(NodeEnvironment.java:414) |  
  | »  	at [email protected]/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:307) |  
  | »  	at [email protected]/org.elasticsearch.node.Node.<init>(Node.java:485) |  
  | »  	... 4 more |  
  | » |  
  | »  ERROR: Elasticsearch did not exit normally - check the logs at /dev/shm/elastic+elasticsearch+main+periodic+java-fips-matrix/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/logs/v7.17.11-local.log |  
  | » |  
  | »  ERROR: Elasticsearch exited unexpectedly |  
  | »   ↓ last 40 non error or warning messages from /dev/shm/elastic+elasticsearch+main+periodic+java-fips-matrix/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/logs/es.out ↓ |  
  | » [2023-05-15T21:20:18,670][INFO ][o.e.p.PluginsService     ] [v7.17.11-local-0] loaded module [mapper-version] |  
  | » [2023-05-15T21:20:18,670][INFO ][o.e.p.PluginsService     ] [v7.17.11-local-0] loaded module [mapper-extras] |  
  | » [2023-05-15T21:20:18,670][INFO ][o.e.p.PluginsService     ] [v7.17.11-local-0] loaded module [apm] |  
  | » [2023-05-15T21:20:18,671][INFO ][o.e.p.PluginsService     ] [v7.17.11-local-0] loaded module [x-pack-aggregate-metric]


@kingherc kingherc added :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI Team:Search Meta label for search team needs:triage Requires assignment of a team area label labels May 16, 2023
@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label May 16, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@kingherc
Copy link
Contributor Author

kingherc commented May 16, 2023

Another one at https://gradle-enterprise.elastic.co/s/x4zw72rykmhv2/console-log?task=:qa:ccs-rolling-upgrade-remote-cluster:v7.17.11%23oldClusterTest in 8.7



 
--
  | »    ↓ errors and warnings from /dev/shm/elastic+elasticsearch+8.7+periodic+java-fips-matrix/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/logs/es.out ↓ |  
  | » [2023-05-16T02:30:23,971][ERROR][o.e.b.Elasticsearch      ] [v7.17.11-local-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: failed to bind service |  
  | »  	at [email protected]/org.elasticsearch.node.Node.<init>(Node.java:1148) |  
  | »  	at [email protected]/org.elasticsearch.node.Node.<init>(Node.java:324) |  
  | »  	at [email protected]/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:216) |  
  | »  	at [email protected]/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:216) |  
  | »  	at [email protected]/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:67) |  
  | »  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.7.2] is only supported from version [7.17.0]. |  
  | »  	at [email protected]/org.elasticsearch.env.NodeEnvironment.checkForIndexCompatibility(NodeEnvironment.java:515) |  
  | »  	at [email protected]/org.elasticsearch.env.NodeEnvironment.upgradeLegacyNodeFolders(NodeEnvironment.java:414) |  
  | »  	at [email protected]/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:307) |  
  | »  	at [email protected]/org.elasticsearch.node.Node.<init>(Node.java:480) |  
  | »  	... 4 more


@n1v0lg
Copy link
Contributor

n1v0lg commented May 17, 2023

Seeing another failure today: https://gradle-enterprise.elastic.co/s/jmn5iecrre7to

I think the important bit for these is Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.8.0] is only supported from version [7.17.0]

@mark-vieira
Copy link
Contributor

mark-vieira commented May 18, 2023

Strange this is only happening in the FIPS jobs. I can't imagine why that would affect backward compatibility.

@astefan
Copy link
Contributor

astefan commented May 25, 2023

Another one. FIPS again. https://gradle-enterprise.elastic.co/s/iemgifnzjjdi4

@mark-vieira mark-vieira added the :Core/Infra/Core Core issues without another label label May 25, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label May 25, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@stu-elastic stu-elastic self-assigned this May 25, 2023
@craigtaverner
Copy link
Contributor

craigtaverner commented May 26, 2023

I se this again today at https://gradle-enterprise.elastic.co/s/tfb2o66gvmxyk/console-log?anchor=4546&page=5

[2023-05-26T09:22:41,547][ERROR][o.e.b.Elasticsearch      ] [v7.17.11-local-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: failed to bind service |  
  | »  Caused by: org.elasticsearch.gateway.CorruptStateException:
        Format version is not supported. Upgrading to [8.9.0] is only supported from version [7.17.0].

Later in the logs there are perhaps unrelated or irrelevant errors:

[2023-05-26T09:22:33,871][ERROR][o.e.x.c.s.SSLService     ] [v7.17.11-local-1] unsupported ciphers
    [[TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384]] were requested but cannot be used in this JVM,
    however there are supported ciphers that will be used [[TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384, TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA256, TLS_RSA_WITH_AES_128_CBC_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA]]. If you are trying to use ciphers with a key length greater than 128 bits on an Oracle JVM, you will need to install the unlimited strength JCE policy files. |  

@edsavage
Copy link
Contributor

edsavage commented Jun 5, 2023

@ywangd
Copy link
Member

ywangd commented Jun 7, 2023

@stu-elastic
Copy link
Contributor

The failures are always format version is not supported:

»  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.7.2] is only supported from version [7.17.0].
»  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.8.0] is only supported from version [7.17.0].
»  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.8.1] is only supported from version [7.17.0].
»  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.9.0] is only supported from version [7.17.0].
»  Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.8.0] is only supported from version [7.17.0].

@stu-elastic
Copy link
Contributor

The test passes for me locally: https://gradle-enterprise.elastic.co/s/d3evq5lhnvqtw

@stu-elastic
Copy link
Contributor

The actual check in NodeEnvironment.checkForIndexCompatibility is a check for metadata, I wonder if that's having issues on the fips machines.

    static void checkForIndexCompatibility(Logger logger, DataPath... dataPaths) throws IOException {
        final Path[] paths = Arrays.stream(dataPaths).map(np -> np.path).toArray(Path[]::new);
        NodeMetadata metadata = PersistedClusterStateService.nodeMetadata(paths);

        // We are upgrading the cluster, but we didn't find any previous metadata. Corrupted state or incompatible version.
        if (metadata == null) {
            throw new CorruptStateException(
                "Format version is not supported. Upgrading to ["
                    + Version.CURRENT
                    + "] is only supported from version ["
                    + Version.CURRENT.minimumCompatibilityVersion()
                    + "]."
            );
        }

@stu-elastic
Copy link
Contributor

stu-elastic commented Jun 8, 2023

Seems to have started failing on May 3rd for the fips job https://gradle-enterprise.elastic.co/s/whs4mppb44lz2 May 2nd https://gradle-enterprise.elastic.co/s/qiqtib7ht77dw

@stu-elastic
Copy link
Contributor

The only three recent successes for the main job are:
https://gradle-enterprise.elastic.co/s/svnovjom4juue - May 19 2023 15:51:39 CDT
https://gradle-enterprise.elastic.co/s/6u5llg576bue2 - May 22 2023 03:52:05 CDT
https://gradle-enterprise.elastic.co/s/2xao3b55zbgra - Jun 1 2023 15:52:02 CDT

@stu-elastic
Copy link
Contributor

Seems to have started around this commit 7ae8408.

I'm wondering if there's an issue with the fips setup. Tagging security to have y'all take a look

@stu-elastic stu-elastic added :Security/FIPS Running ES in FIPS 140-2 mode and removed :Core/Infra/Core Core issues without another label :Search/Search Search-related issues that do not fall into other categories Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team labels Jun 8, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Jun 8, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@tvernum
Copy link
Contributor

tvernum commented Jun 9, 2023

Has anyone been able to reproduce this locally? (I can't)

@stu-elastic
Copy link
Contributor

I haven't been able to trigger the fips version locally, my runs have all been non-fips.

@bpintea
Copy link
Contributor

bpintea commented Jun 9, 2023

FWIW, another failure: https://gradle-enterprise.elastic.co/s/25zqbmucs2ylk

@stu-elastic
Copy link
Contributor

I'm removing my assignment until security can determine it's an index version issue rather than a fips issue.

FWIW, this task has constantly been failing, there have only been three successes recently. @mark-vieira is there a good way to mute this just for fips for now?

@stu-elastic stu-elastic removed their assignment Jun 12, 2023
@mark-vieira
Copy link
Contributor

FWIW, this task has constantly been failing, there have only been three successes recently. @mark-vieira is there a good way to mute this just for fips for now?

You can add an assumeFalse() to the test. For example.

@stu-elastic
Copy link
Contributor

Silenced in #96776

@tvernum
Copy link
Contributor

tvernum commented Jun 13, 2023

I can reproduce on a dedicated worker.

./gradlew :qa:ccs-rolling-upgrade-remote-cluster:check  -Druntime.java=17 -Dtests.fips.enabled=true
* What went wrong:
Execution failed for task ':qa:ccs-rolling-upgrade-remote-cluster:v7.17.11#oldClusterTest'.
> process was found dead while waiting for ports files, node{:qa:ccs-rolling-upgrade-remote-cluster:v7.17.11-local-0}
[2023-06-13T03:07:38,498][ERROR][o.e.b.Elasticsearch      ] [v7.17.11-local-0] fatal exception while booting Elasticsearch org.elasticsearch.ElasticsearchException: failed to bind service
        at [email protected]/org.elasticsearch.node.Node.<init>(Node.java:1190)
        at [email protected]/org.elasticsearch.node.Node.<init>(Node.java:334)
        at [email protected]/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:231)
        at [email protected]/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:231)
        at [email protected]/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:71)
Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.9.0] is only supported from version [7.17.0].
        at [email protected]/org.elasticsearch.env.NodeEnvironment.checkForIndexCompatibility(NodeEnvironment.java:515)

It looks like that data directory is empty on the node:

$ ls -sR /dev/shm/elasticsearch/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/data
/dev/shm/elasticsearch/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/data:
total 0
0 node.lock  0 nodes

/dev/shm/elasticsearch/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/data/nodes:
total 0
0 0

/dev/shm/elasticsearch/qa/ccs-rolling-upgrade-remote-cluster/build/testclusters/v7.17.11-local-0/data/nodes/0:
total 0
0 node.lock

When I run the test without FIPS, and kill it at the same point, there is an indices/ directory, and various data files in _state/

So, I think the question is why doesn't any data get written when FIPS is enabled.

@tvernum
Copy link
Contributor

tvernum commented Jun 13, 2023

It looks like I killed the non-FIPS test too late, and my conclusion was wrong.

With more debugging, it now seems that the FIPS test is creating a node.lock etc too early. There should be nothing in the data directory at this point, but there is.
I wonder if the clusters are interfering with each other somehow (on FIPS only).

@slobodanadamovic
Copy link
Contributor

Not sure if this is something that could potentially explain the cause of this issue, but I've noticed that there is a pattern for these failures. Whenever the execution fails it seems that :qa:ccs-rolling-upgrade-remote-cluster:v8.9.0 task was started before :qa:ccs-rolling-upgrade-remote-cluster:v7.17.11:

image

In case of successful executions, the versions are in ascending order:

image

@mark-vieira
Copy link
Contributor

The order shouldn't matter. These clusters use fresh working directories so they won't interfere with eachother.

@tvernum
Copy link
Contributor

tvernum commented Jun 22, 2023

I think (wild theory incoming ...) this test is just broken, and the brokenness shows up differently with FIPS. I'm not sure why it just started happening.

Essentially when forming the clusters for this test, we start to upgrade the node versions before we make sure a cluster has formed

On non-FIPS that seems to kill the cluster while it still has an empty data dir, and the upgraded node is fine (it works like a clean boot).
On FIPS it kills it while it has a node lock but no actual state. That means that the upgraded node doesn't start because it has a data-dir that isn't safe (the failure message isn't ideal, but the state is genuinely bad)

I tried to add localCluster.get().waitForAllConditions() before upgrading the node version, but then it fails with :

Caused by: java.lang.IllegalStateException: node version [7.17.11] may not join a cluster comprising only nodes of version [8.9.0] or greater
»       at org.elasticsearch.cluster.coordination.NodeJoinExecutor.ensureVersionBarrier(NodeJoinExecutor.java:414) ~[?:?]
»       at org.elasticsearch.cluster.coordination.Coordinator.validateJoinRequest(Coordinator.java:689) ~[elasticsearch-7.17.11-SNAPSHOT.jar:7.17.11-SNAPSHOT]
»       at org.elasticsearch.cluster.coordination.Coordinator$2.onResponse(Coordinator.java:631) ~[elasticsearch-7.17.11-SNAPSHOT.jar:7.17.11-SNAPSHOT]
»       at org.elasticsearch.cluster.coordination.Coordinator$2.onResponse(Coordinator.java:626) ~[elasticsearch-7.17.11-SNAPSHOT.jar:7.17.11-SNAPSHOT]

(even on non-FIPS).

I haven't tried to work out why we're ending up in a state where we're adding a 7.x node to an 8.x cluster. This test confuses me a bit and I'm not sure exactly what it's trying to do.

@quux00
Copy link
Contributor

quux00 commented Jul 12, 2023

This is still failing - on 8.8 today: https://gradle-enterprise.elastic.co/s/4zwyhji33bvl4

@kingherc
Copy link
Contributor Author

Another one 8.8 at https://gradle-enterprise.elastic.co/s/qdti5ijqsjxaq

@gwbrown gwbrown added the low-risk An open issue or test failure that is a low risk to future releases label Oct 13, 2023
@elasticsearchmachine elasticsearchmachine closed this as not planned Won't fix, can't repro, duplicate, stale Nov 5, 2024
@elasticsearchmachine
Copy link
Collaborator

This issue has been closed because it has been open for too long with no activity.

Any muted tests that were associated with this issue have been unmuted.

If the tests begin failing again, a new issue will be opened, and they may be muted again.

@slobodanadamovic
Copy link
Contributor

Any muted tests that were associated with this issue have been unmuted.

Reopening because the test is still excluded from running in FIPS mode.

jakelandis added a commit that referenced this issue Jan 8, 2025
For undetermined reasons this test is flaky when run in FIPS mode. 
There is suspicion that the failure is due to some odd test only behavior . 
This PR re-enables the test now that 
a) the FIPS libary jar(s) have been updated 
b) the BWC from main is 8->9 (not 7->8). 
This un-mute will not be backported to 8.x and will remain muted in the 8.x branch. 
🤞 that what ever caused the instability is addressed by the newer versions. 
If not, we can re-mute this test when running in FIPS mode. 
The value of this test is not specific to FIPS and if this problems continue, 
then it is unlikely worth the effort to continue the investigation to why it is 
flaky when FIPS is enabled.

closes #96134
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
low-risk An open issue or test failure that is a low risk to future releases :Security/FIPS Running ES in FIPS 140-2 mode Team:Security Meta label for security team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.