Consul service mesh containers failing in Nomad due to token persistence after upgrade #22115

lopcode · 2025-02-01T17:34:11Z

Overview of the Issue

Hi there,

I just spent a couple of hours debugging an outage on my personal cluster after a cluster upgrade, and the fix was relatively simple, so wanted to raise an issue in case others are having the same problem.

I upgraded my personal cluster from Consul 1.20.0 -> 1.20.2, and Nomad 1.9.4 -> 1.9.5. Part of that upgrade process involves rebooting the servers, and the upgrade process has worked fine for a few years. This time, after doing so, all Consul service mesh / envoy containers started failing, and I was struggling to figure out why. The logs all looked like this:

[2025-02-01 15:30:17.148][1][info][admin] [source/server/admin/admin.cc:65] admin address: 127.0.0.2:19001
[2025-02-01 15:30:17.149][1][info][config] [source/server/configuration_impl.cc:168] loading tracing configuration
[2025-02-01 15:30:17.149][1][info][config] [source/server/configuration_impl.cc:124] loading 0 static secret(s)
[2025-02-01 15:30:17.149][1][info][config] [source/server/configuration_impl.cc:130] loading 1 cluster(s)
[2025-02-01 15:30:17.211][1][info][config] [source/server/configuration_impl.cc:138] loading 0 listener(s)
[2025-02-01 15:30:17.211][1][info][config] [source/server/configuration_impl.cc:154] loading stats configuration
[2025-02-01 15:30:17.211][1][info][runtime] [source/common/runtime/runtime_impl.cc:625] RTDS has finished initialization
[2025-02-01 15:30:17.211][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:245] cm init: initializing cds
[2025-02-01 15:30:17.212][1][warning][main] [source/server/server.cc:936] There is no configured limit to the number of allowed active downstream connections. Configure a limit in `envoy.resource_monitors.downstream_connections` resource monitor.
[2025-02-01 15:30:17.212][1][info][main] [source/server/server.cc:978] starting main dispatch loop
[2025-02-01 15:30:17.221][1][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 2 cluster(s), remove 0 cluster(s)
[2025-02-01 15:30:17.366][1][info][upstream] [source/common/upstream/cds_api_helper.cc:71] cds: added/updated 2 cluster(s), skipped 0 unmodified cluster(s)
[2025-02-01 15:30:17.366][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:223] cm init: initializing secondary clusters
[2025-02-01 15:30:17.368][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:249] cm init: all clusters initialized
[2025-02-01 15:30:17.368][1][info][main] [source/server/server.cc:958] all clusters initialized. initializing init manager
[2025-02-01 15:30:17.375][1][info][upstream] [source/common/listener_manager/lds_api.cc:106] lds: add/update listener 'public_listener:0.0.0.0:31944'
[2025-02-01 15:30:17.376][1][info][upstream] [source/common/listener_manager/lds_api.cc:106] lds: add/update listener 'shared-redis:127.0.0.1:6379'
[2025-02-01 15:30:17.376][1][info][config] [source/common/listener_manager/listener_manager_impl.cc:930] all dependencies initialized. starting workers
[2025-02-01 15:31:22.570][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:176] DeltaAggregatedResources gRPC config stream to local_agent closed: 13, 
[2025-02-01 15:31:22.608][1][warning][main] [source/server/server.cc:907] caught ENVOY_SIGTERM
[2025-02-01 15:31:22.608][1][info][main] [source/server/server.cc:1046] shutting down server instance
[2025-02-01 15:31:22.608][1][info][main] [source/server/server.cc:986] main dispatch loop exited
[2025-02-01 15:31:22.613][1][info][main] [source/server/server.cc:1038] exiting

So, startup looked OK, "starting workers", a 5 second pause, and then "gRPC config stream to local_agent closed: 13". This persisted across multiple reboots, a rollback, and all attempts on all jobs over the space of an hour or so.

After some searching I found that some folks had success for (seemingly) unrelated issues setting acl.enable_token_persistence to false, which also fixed my issue - after setting that and restarting, the sidecar workloads immediately started working again.

My guess is that somehow a token got broken/corrupted/wiped out after a server reboot (which is odd in itself). I'm quite certain that changing this flag is what fixed the problem (but there's always a small chance that the Nth restart fixed the underlying problem).

My suggestion is that, if you think this might be an issue, it could potentially be added to https://support.hashicorp.com/hc/en-us/articles/5295078989075-Resolving-Common-Errors-in-Envoy-Proxy-A-Troubleshooting-Guide - there's already an example with "error 14" but that's slightly different to my problem.

I have no idea how I would reproduce this, or what info might be helpful, so let me know if I can give you anything to help. Some basic info:

Consul versions: 1.20.0 -> 1.20.2
Nomad versions 1.9.4 -> 1.9.5
Using TLS (except with gRPC verify_incoming set to false)
Using workload identity / ACL

The text was updated successfully, but these errors were encountered:

lopcode · 2025-02-01T17:40:38Z

This is the issue that helped me - interestingly, it includes the addition of a preflight check that I think was supposed to fix (or at least detect) something related: hashicorp/nomad#20516

I'm also not sure whether this issue should live on the nomad repo, or this one - I changed a consul config option to fix things, so I put it here.

lopcode changed the title ~~Nomad service mesh containers failing due to token persistence after upgrade~~ Consul service mesh containers failing in Nomad due to token persistence after upgrade Feb 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul service mesh containers failing in Nomad due to token persistence after upgrade #22115

Consul service mesh containers failing in Nomad due to token persistence after upgrade #22115

lopcode commented Feb 1, 2025

lopcode commented Feb 1, 2025 •

edited

Loading

Consul service mesh containers failing in Nomad due to token persistence after upgrade #22115

Consul service mesh containers failing in Nomad due to token persistence after upgrade #22115

Comments

lopcode commented Feb 1, 2025

Overview of the Issue

lopcode commented Feb 1, 2025 • edited Loading

lopcode commented Feb 1, 2025 •

edited

Loading