Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x] [kbn-test] retry 5xx in saml callback (#208977) #211023

Merged
merged 2 commits into from
Feb 13, 2025

Conversation

kibanamachine
Copy link
Contributor

Backport

This will backport the following commits from main to 8.x:

Questions ?

Please refer to the Backport tool documentation

\n\n### Questions ?\nPlease refer to the [Backport tool\ndocumentation](https://github.com/sqren/backport)\n\n\n\nCo-authored-by: Dzmitry Lemechko "}},{"branch":"main","label":"v9.1.0","branchLabelMappingKey":"^v9.1.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com//pull/208977","number":208977,"mergeCommit":{"message":"[kbn-test] retry 5xx in saml callback (#208977)\n\n## Summary\r\n\r\nWhen we run Scout tests in parallel, we call SAML authentication in\r\nparallel too and since by default `.security-profile-8` index does not\r\nexist, we periodically getting 503 response:\r\n\r\n```\r\n proc [kibana] [2025-01-29T11:13:10.420+01:00][ERROR][plugins.security.user-profile] \r\nFailed to activate user profile: {\"error\":{\"root_cause\":[{\"type\":\"unavailable_shards_exception\",\"reason\":\r\n\"at least one search shard for the index [.security-profile-8] is unavailable\"}],\r\n\"type\":\"unavailable_shards_exception\",\"reason\":\"at least one search shard\r\nfor the index [.security-profile-8] is unavailable\"},\"status\":503}. {\"service\":{\"node\":\r\n{\"roles\":[\"background_tasks\",\"ui\"]}}}\r\n```\r\n\r\nThe solution is to retry the SAML callback assuming that index will be\r\ncreated and the issue will be solved.\r\nWe agreed with Kibana-Security to retry only **5xx** errors, because for\r\n**4xx** we most likely have to start the authentication from the start.\r\n\r\nFor reviews: it is not 100% reproducible, so I added unit tests to\r\nverify the retry logic is working only for 5xx requests. Please let me\r\nknow if I miss something\r\n\r\nRetry was verified locally, you might be seeing this logs output:\r\n\r\n```\r\n proc [kibana] [2025-01-30T18:40:41.348+01:00][ERROR][plugins.security.user-profile] Failed to activate user profile:\r\n{\"error\":{\"root_cause\":[{\"type\":\"unavailable_shards_exception\",\"reason\":\"at least one search shard for the index\r\n[.security-profile-8] is unavailable\"}],\"type\":\"unavailable_shards_exception\",\"reason\":\"at least one search shard\r\nfor the index [.security-profile-8] is unavailable\"},\"status\":503}. {\"service\":{\"node\":{\"roles\":[\"background_tasks\",\"ui\"]}}}\r\n proc [kibana] [2025-01-30T18:40:41.349+01:00][ERROR][plugins.security.authentication] Login attempt with \"saml\"\r\nprovider failed due to unexpected error: {\"error\":{\"root_cause\":[{\"type\":\"unavailable_shards_exception\",\"reason\":\r\n\"at least one search shard for the index [.security-profile-8] is unavailable\"}],\"type\":\"unavailable_shards_exception\",\r\n\"reason\":\"at least one search shard for the index [.security-profile-8] is unavailable\"},\"status\":503}\r\n{\"service\":{\"node\":{\"roles\":[\"background_tasks\",\"ui\"]}}}\r\n proc [kibana] [2025-01-30T18:40:41.349+01:00][ERROR][http] 500 Server Error {\"http\":{\"response\":{\"status_code\":500},\"request\":{\"method\":\"post\",\"path\":\"/api/security/saml/callback\"}},\"error\":\r\n{\"message\":\"unavailable_shards_exception\\n\\tRoot causes:\\n\\t\\tunavailable_shards_exception: at least one\r\nsearch shard for the index [.security-profile-8] is\r\n ERROR [scout] SAML callback failed: expected 302, got 500\r\n Waiting 939 ms before the next attempt\r\n proc [playwright]\r\n info [o.e.c.r.a.AllocationService] [scout] current.health=\"GREEN\" message=\"Cluster health status changed\r\nfrom [YELLOW] to [GREEN] (reason: [shards started [[.security-profile-8][0]]]).\"\r\nprevious.health=\"YELLOW\" reason=\"shards started [[.security-profile-8][0]]\"\r\n```\r\n\r\nTo reproduce: \r\n```\r\nnode scripts/scout.js run-tests --stateful --config x-pack/platform/plugins/private/discover_enhanced/ui_tests/parallel.playwright.config.ts\r\n```\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <[email protected]>","sha":"2b5bbf8f86f0c6e0e05ab5e6381bba4919c64e33"}},{"branch":"8.x","label":"v8.19.0","branchLabelMappingKey":"^v8.19.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT-->

## Summary

When we run Scout tests in parallel, we call SAML authentication in
parallel too and since by default `.security-profile-8` index does not
exist, we periodically getting 503 response:

```
 proc [kibana] [2025-01-29T11:13:10.420+01:00][ERROR][plugins.security.user-profile]
Failed to activate user profile: {"error":{"root_cause":[{"type":"unavailable_shards_exception","reason":
"at least one search shard for the index [.security-profile-8] is unavailable"}],
"type":"unavailable_shards_exception","reason":"at least one search shard
for the index [.security-profile-8] is unavailable"},"status":503}. {"service":{"node":
{"roles":["background_tasks","ui"]}}}
```

The solution is to retry the SAML callback assuming that index will be
created and the issue will be solved.
We agreed with Kibana-Security to retry only **5xx** errors, because for
**4xx** we most likely have to start the authentication from the start.

For reviews: it is not 100% reproducible, so I added unit tests to
verify the retry logic is working only for 5xx requests. Please let me
know if I miss something

Retry was verified locally, you might be seeing this logs output:

```
 proc [kibana] [2025-01-30T18:40:41.348+01:00][ERROR][plugins.security.user-profile] Failed to activate user profile:
{"error":{"root_cause":[{"type":"unavailable_shards_exception","reason":"at least one search shard for the index
[.security-profile-8] is unavailable"}],"type":"unavailable_shards_exception","reason":"at least one search shard
for the index [.security-profile-8] is unavailable"},"status":503}. {"service":{"node":{"roles":["background_tasks","ui"]}}}
 proc [kibana] [2025-01-30T18:40:41.349+01:00][ERROR][plugins.security.authentication] Login attempt with "saml"
provider failed due to unexpected error: {"error":{"root_cause":[{"type":"unavailable_shards_exception","reason":
"at least one search shard for the index [.security-profile-8] is unavailable"}],"type":"unavailable_shards_exception",
"reason":"at least one search shard for the index [.security-profile-8] is unavailable"},"status":503}
{"service":{"node":{"roles":["background_tasks","ui"]}}}
 proc [kibana] [2025-01-30T18:40:41.349+01:00][ERROR][http] 500 Server Error {"http":{"response":{"status_code":500},"request":{"method":"post","path":"/api/security/saml/callback"}},"error":
{"message":"unavailable_shards_exception\n\tRoot causes:\n\t\tunavailable_shards_exception: at least one
search shard for the index [.security-profile-8] is
    ERROR [scout] SAML callback failed: expected 302, got 500
    Waiting 939 ms before the next attempt
 proc [playwright]
 info [o.e.c.r.a.AllocationService] [scout] current.health="GREEN" message="Cluster health status changed
from [YELLOW] to [GREEN] (reason: [shards started [[.security-profile-8][0]]])."
previous.health="YELLOW" reason="shards started [[.security-profile-8][0]]"
```

To reproduce:
```
node scripts/scout.js run-tests --stateful --config x-pack/platform/plugins/private/discover_enhanced/ui_tests/parallel.playwright.config.ts
```

---------

Co-authored-by: kibanamachine <[email protected]>
(cherry picked from commit 2b5bbf8)
@kibanamachine kibanamachine merged commit 9347de1 into elastic:8.x Feb 13, 2025
8 checks passed
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/scout 84 85 +1
Unknown metric groups

API count

id before after diff
@kbn/scout 112 113 +1

History

cc @dmlemeshko

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants