Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client connection errors occur during rolling restart on Kubernetes deployment for client-side pooled application #902

Open
ian-axelrod-sl opened this issue Jan 31, 2025 · 1 comment

Comments

@ian-axelrod-sl
Copy link

ian-axelrod-sl commented Jan 31, 2025

Describe the bug
Performing a rolling restart of Kubernetes-deployed pgcat results in a portion of client connections failing with the error message:

psycopg2.OperationalError: SSL SYSCALL error: EOF detected

I am able to perform the rolling restart on PgBouncer without errors with the exact same (where applicable) settings.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy pgcat on a Kubernetes cluster.
  2. Configure client-side connection pooling on the application with a short connection recycle (i.e., lifetime) period that is shorter than the graceful termination period for pgcat. This should be a python application using sqlalchemy. pre-ping should not be set.
  3. After the application has connected, perform a rolling deploy.

Expected behavior
Rolling restart should not produce any connection errors on the client's side.

Additional context
I believe this issue stems from the lack of an optimistic connection health check in the codebase I work with. sqlalchemy (a python ORM) can perform a health check query when a connection is checked out of its connection pool, but this setting is not enabled by default and my codebase does not use it.

Assuming this is the issue, I am not sure if I would consider this a bug or a feature request. Ideally there would be option to allow clients to disconnect only when they initiate it. I believe this is what pgbouncer is doing. In pgcat, on shutdown request, I see messages like the following:

Client [IP] disconnected, session duration: 0d 00:00:43.36

I have connection lifetime in sqlalchemy set to 120 seconds, so the fact I am seeing clients disconnect earlier than this makes me think pgcat is initiating this. In pgbouncer, I see that it takes the full 2 minutes for clients to drain. I would expect to see something like this:

2025-01-31T02:58:33.780711Z  INFO ThreadId(13) pgcat: Client [ip] disconnected, session duration: 0d 00:02:00.641    
2025-01-31T02:58:34.091768Z  INFO ThreadId(12) pgcat: Client [ip]  disconnected, session duration: 0d 00:02:01.106    
2025-01-31T02:58:34.098774Z  INFO ThreadId(12) pgcat: Client [ip]  disconnected, session duration: 0d 00:02:02.368    
2025-01-31T02:58:34.133667Z  INFO ThreadId(13) pgcat: Client [ip]  disconnected, session duration: 0d 00:02:01.257
@ian-axelrod-sl
Copy link
Author

ian-axelrod-sl commented Jan 31, 2025

Essentially, I was expecting graceful shutdown to by default implement pgbouncer's SHUTDOWN WAIT_FOR_CLIENTS behavior:

Stop accepting new connections and shutdown the process once all existing clients have disconnected.

Based on their zero-downtime example right below that quote:

Run SHUTDOWN WAIT_FOR_CLIENTS (or send SIGTERM) to process A.
Cause all clients to reconnect. Possibly by waiting some time until the client side pooler causes reconnects due to its server_idle_timeout (or similar config). Or if no client side pooler is used, possibly by restarting the clients. Once all clients have reconnected. Process A will exit automatically, because no clients are connected to it anymore.

It seems to me that pgbouncer is just waiting for the clients to disconnect on their own terms. There's no informing the client of a shutdown, as far as I can tell. This may be inefficient, but it seems like a much safer default.

From what I understand of pgcat's code, which is very little since I do not know rust, it looks like you are explicitly sending a termination message to the client if they are idle. This would break clients that aren't handling that message and are expecting to checkout a connection that is in good health.

Correct me if I am completely misunderstanding how the code works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant