You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just spent a couple of hours debugging an outage on my personal cluster after a cluster upgrade, and the fix was relatively simple, so wanted to raise an issue in case others are having the same problem.
I upgraded my personal cluster from Consul 1.20.0 -> 1.20.2, and Nomad 1.9.4 -> 1.9.5. Part of that upgrade process involves rebooting the servers, and the upgrade process has worked fine for a few years. This time, after doing so, all Consul service mesh / envoy containers started failing, and I was struggling to figure out why. The logs all looked like this:
[2025-02-01 15:30:17.148][1][info][admin] [source/server/admin/admin.cc:65] admin address: 127.0.0.2:19001
[2025-02-01 15:30:17.149][1][info][config] [source/server/configuration_impl.cc:168] loading tracing configuration
[2025-02-01 15:30:17.149][1][info][config] [source/server/configuration_impl.cc:124] loading 0 static secret(s)
[2025-02-01 15:30:17.149][1][info][config] [source/server/configuration_impl.cc:130] loading 1 cluster(s)
[2025-02-01 15:30:17.211][1][info][config] [source/server/configuration_impl.cc:138] loading 0 listener(s)
[2025-02-01 15:30:17.211][1][info][config] [source/server/configuration_impl.cc:154] loading stats configuration
[2025-02-01 15:30:17.211][1][info][runtime] [source/common/runtime/runtime_impl.cc:625] RTDS has finished initialization
[2025-02-01 15:30:17.211][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:245] cm init: initializing cds
[2025-02-01 15:30:17.212][1][warning][main] [source/server/server.cc:936] There is no configured limit to the number of allowed active downstream connections. Configure a limit in `envoy.resource_monitors.downstream_connections` resource monitor.
[2025-02-01 15:30:17.212][1][info][main] [source/server/server.cc:978] starting main dispatch loop
[2025-02-01 15:30:17.221][1][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 2 cluster(s), remove 0 cluster(s)
[2025-02-01 15:30:17.366][1][info][upstream] [source/common/upstream/cds_api_helper.cc:71] cds: added/updated 2 cluster(s), skipped 0 unmodified cluster(s)
[2025-02-01 15:30:17.366][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:223] cm init: initializing secondary clusters
[2025-02-01 15:30:17.368][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:249] cm init: all clusters initialized
[2025-02-01 15:30:17.368][1][info][main] [source/server/server.cc:958] all clusters initialized. initializing init manager
[2025-02-01 15:30:17.375][1][info][upstream] [source/common/listener_manager/lds_api.cc:106] lds: add/update listener 'public_listener:0.0.0.0:31944'
[2025-02-01 15:30:17.376][1][info][upstream] [source/common/listener_manager/lds_api.cc:106] lds: add/update listener 'shared-redis:127.0.0.1:6379'
[2025-02-01 15:30:17.376][1][info][config] [source/common/listener_manager/listener_manager_impl.cc:930] all dependencies initialized. starting workers
[2025-02-01 15:31:22.570][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:176] DeltaAggregatedResources gRPC config stream to local_agent closed: 13,
[2025-02-01 15:31:22.608][1][warning][main] [source/server/server.cc:907] caught ENVOY_SIGTERM
[2025-02-01 15:31:22.608][1][info][main] [source/server/server.cc:1046] shutting down server instance
[2025-02-01 15:31:22.608][1][info][main] [source/server/server.cc:986] main dispatch loop exited
[2025-02-01 15:31:22.613][1][info][main] [source/server/server.cc:1038] exiting
So, startup looked OK, "starting workers", a 5 second pause, and then "gRPC config stream to local_agent closed: 13". This persisted across multiple reboots, a rollback, and all attempts on all jobs over the space of an hour or so.
After some searching I found that some folks had success for (seemingly) unrelated issues setting acl.enable_token_persistence to false, which also fixed my issue - after setting that and restarting, the sidecar workloads immediately started working again.
My guess is that somehow a token got broken/corrupted/wiped out after a server reboot (which is odd in itself). I'm quite certain that changing this flag is what fixed the problem (but there's always a small chance that the Nth restart fixed the underlying problem).
This is the issue that helped me - interestingly, it includes the addition of a preflight check that I think was supposed to fix (or at least detect) something related: hashicorp/nomad#20516
I'm also not sure whether this issue should live on the nomad repo, or this one - I changed a consul config option to fix things, so I put it here.
lopcode
changed the title
Nomad service mesh containers failing due to token persistence after upgrade
Consul service mesh containers failing in Nomad due to token persistence after upgrade
Feb 1, 2025
Overview of the Issue
Hi there,
I just spent a couple of hours debugging an outage on my personal cluster after a cluster upgrade, and the fix was relatively simple, so wanted to raise an issue in case others are having the same problem.
I upgraded my personal cluster from Consul
1.20.0
->1.20.2
, and Nomad1.9.4
->1.9.5
. Part of that upgrade process involves rebooting the servers, and the upgrade process has worked fine for a few years. This time, after doing so, all Consul service mesh / envoy containers started failing, and I was struggling to figure out why. The logs all looked like this:So, startup looked OK, "starting workers", a 5 second pause, and then "gRPC config stream to local_agent closed: 13". This persisted across multiple reboots, a rollback, and all attempts on all jobs over the space of an hour or so.
After some searching I found that some folks had success for (seemingly) unrelated issues setting
acl.enable_token_persistence
tofalse
, which also fixed my issue - after setting that and restarting, the sidecar workloads immediately started working again.My guess is that somehow a token got broken/corrupted/wiped out after a server reboot (which is odd in itself). I'm quite certain that changing this flag is what fixed the problem (but there's always a small chance that the Nth restart fixed the underlying problem).
My suggestion is that, if you think this might be an issue, it could potentially be added to https://support.hashicorp.com/hc/en-us/articles/5295078989075-Resolving-Common-Errors-in-Envoy-Proxy-A-Troubleshooting-Guide - there's already an example with "error 14" but that's slightly different to my problem.
I have no idea how I would reproduce this, or what info might be helpful, so let me know if I can give you anything to help. Some basic info:
1.20.0
->1.20.2
1.9.4
->1.9.5
verify_incoming
set to false)The text was updated successfully, but these errors were encountered: