Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul service mesh containers failing in Nomad due to token persistence after upgrade #22115

Open
lopcode opened this issue Feb 1, 2025 · 1 comment

Comments

@lopcode
Copy link

lopcode commented Feb 1, 2025

Overview of the Issue

Hi there,

I just spent a couple of hours debugging an outage on my personal cluster after a cluster upgrade, and the fix was relatively simple, so wanted to raise an issue in case others are having the same problem.

I upgraded my personal cluster from Consul 1.20.0 -> 1.20.2, and Nomad 1.9.4 -> 1.9.5. Part of that upgrade process involves rebooting the servers, and the upgrade process has worked fine for a few years. This time, after doing so, all Consul service mesh / envoy containers started failing, and I was struggling to figure out why. The logs all looked like this:

[2025-02-01 15:30:17.148][1][info][admin] [source/server/admin/admin.cc:65] admin address: 127.0.0.2:19001
[2025-02-01 15:30:17.149][1][info][config] [source/server/configuration_impl.cc:168] loading tracing configuration
[2025-02-01 15:30:17.149][1][info][config] [source/server/configuration_impl.cc:124] loading 0 static secret(s)
[2025-02-01 15:30:17.149][1][info][config] [source/server/configuration_impl.cc:130] loading 1 cluster(s)
[2025-02-01 15:30:17.211][1][info][config] [source/server/configuration_impl.cc:138] loading 0 listener(s)
[2025-02-01 15:30:17.211][1][info][config] [source/server/configuration_impl.cc:154] loading stats configuration
[2025-02-01 15:30:17.211][1][info][runtime] [source/common/runtime/runtime_impl.cc:625] RTDS has finished initialization
[2025-02-01 15:30:17.211][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:245] cm init: initializing cds
[2025-02-01 15:30:17.212][1][warning][main] [source/server/server.cc:936] There is no configured limit to the number of allowed active downstream connections. Configure a limit in `envoy.resource_monitors.downstream_connections` resource monitor.
[2025-02-01 15:30:17.212][1][info][main] [source/server/server.cc:978] starting main dispatch loop
[2025-02-01 15:30:17.221][1][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 2 cluster(s), remove 0 cluster(s)
[2025-02-01 15:30:17.366][1][info][upstream] [source/common/upstream/cds_api_helper.cc:71] cds: added/updated 2 cluster(s), skipped 0 unmodified cluster(s)
[2025-02-01 15:30:17.366][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:223] cm init: initializing secondary clusters
[2025-02-01 15:30:17.368][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:249] cm init: all clusters initialized
[2025-02-01 15:30:17.368][1][info][main] [source/server/server.cc:958] all clusters initialized. initializing init manager
[2025-02-01 15:30:17.375][1][info][upstream] [source/common/listener_manager/lds_api.cc:106] lds: add/update listener 'public_listener:0.0.0.0:31944'
[2025-02-01 15:30:17.376][1][info][upstream] [source/common/listener_manager/lds_api.cc:106] lds: add/update listener 'shared-redis:127.0.0.1:6379'
[2025-02-01 15:30:17.376][1][info][config] [source/common/listener_manager/listener_manager_impl.cc:930] all dependencies initialized. starting workers
[2025-02-01 15:31:22.570][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:176] DeltaAggregatedResources gRPC config stream to local_agent closed: 13, 
[2025-02-01 15:31:22.608][1][warning][main] [source/server/server.cc:907] caught ENVOY_SIGTERM
[2025-02-01 15:31:22.608][1][info][main] [source/server/server.cc:1046] shutting down server instance
[2025-02-01 15:31:22.608][1][info][main] [source/server/server.cc:986] main dispatch loop exited
[2025-02-01 15:31:22.613][1][info][main] [source/server/server.cc:1038] exiting

So, startup looked OK, "starting workers", a 5 second pause, and then "gRPC config stream to local_agent closed: 13". This persisted across multiple reboots, a rollback, and all attempts on all jobs over the space of an hour or so.

After some searching I found that some folks had success for (seemingly) unrelated issues setting acl.enable_token_persistence to false, which also fixed my issue - after setting that and restarting, the sidecar workloads immediately started working again.

My guess is that somehow a token got broken/corrupted/wiped out after a server reboot (which is odd in itself). I'm quite certain that changing this flag is what fixed the problem (but there's always a small chance that the Nth restart fixed the underlying problem).

My suggestion is that, if you think this might be an issue, it could potentially be added to https://support.hashicorp.com/hc/en-us/articles/5295078989075-Resolving-Common-Errors-in-Envoy-Proxy-A-Troubleshooting-Guide - there's already an example with "error 14" but that's slightly different to my problem.

I have no idea how I would reproduce this, or what info might be helpful, so let me know if I can give you anything to help. Some basic info:

  • Consul versions: 1.20.0 -> 1.20.2
  • Nomad versions 1.9.4 -> 1.9.5
  • Using TLS (except with gRPC verify_incoming set to false)
  • Using workload identity / ACL
@lopcode
Copy link
Author

lopcode commented Feb 1, 2025

This is the issue that helped me - interestingly, it includes the addition of a preflight check that I think was supposed to fix (or at least detect) something related: hashicorp/nomad#20516

I'm also not sure whether this issue should live on the nomad repo, or this one - I changed a consul config option to fix things, so I put it here.

@lopcode lopcode changed the title Nomad service mesh containers failing due to token persistence after upgrade Consul service mesh containers failing in Nomad due to token persistence after upgrade Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant