Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate thread pool role thread restarts (+UnbufferedReadLine fix) #143

Merged
merged 4 commits into from
Jul 6, 2024

Conversation

RedFlames
Copy link
Collaborator

@RedFlames RedFlames commented Jul 5, 2024

Attempting to fix three places where thread roles can die unexpectedly and cause them to restart.

Edit: This now also has a ReadLine buffering related fix, see further down.


Within 2024 these are the uniq -c times these occurred:

     10  Error in thread pool thread x: System.Collections.Generic.KeyNotFoundException: The given key '...' was not present in the dictionary.
    241  Error in thread pool thread x: System.InvalidOperationException: Not currently flushing queue 'UDP Queue'!
    293  Error in thread pool thread x: System.SystemException: Couldn't poll the EPoll FD: 4

"The given key '...' was not present in the dictionary."

At one point in the TCPEPollPoller there used to be a

ConIds.TryGetValue(id, out ConPlusTCPUDPConnection? conId)

but got turned into

yield return ConIds[id];

which can fail somehow.
Does StartPolling need to grab PollerLock.R() at some point? I have no clue how this works :)

Now I turned it back into a TryGetValue and log a warning when that fails, but proceed anyways.

"Couldn't poll the EPoll FD: 4"

So here we had, uh...

    // Poll the EPoll FD
    int ret;
    do {
        ret = epoll_wait(EpollFD, evts, 1, -1);
    } while (!token.IsCancellationRequested && ret == EINTR);

    if (ret < 0)
        throw new SystemException($"Couldn't poll the EPoll FD: {Marshal.GetLastWin32Error()}");

where the do-while can be exited by either the token cancellation request or any ret value besides EINTR. I'm unsure why EINTR is allowed to go back into the poll wait if it wasn't from cancellation.

When a cancellation does happen, I don't actually know if it does or doesn't cause an EINTR, I think it actually shouldn't, it just triggers an event on the CancelFD. And that event just gets ignored because it's int.MaxValue'd

So I guess the throw just shouldn't happen for an EINTR because it can only "break out" in conjunction with a cancellation - but the events should still be processed and then the outer while loop exits from cancellation?

Edit: I'm saying nonsense things. What even is going on.

The definition of EINTR here in TCPEPollPoller is = 4.

When successful, epoll_wait() returns the number of file descriptors ready for the requested I/O, [...]
When an error occurs, epoll_wait() returns -1 and errno is set appropriately.

So what we've been doing is... if the return value is 4, which indicates 4 successful FDs ready... we ignore them and poll again.

What this most definitely meant to check for is... && ret < 0 && Marshal.GetLastWin32Error() == EINTR, though I'm still unsure for the reason. 🙃

"Not currently flushing queue 'UDP Queue'!"

First off, there's also these errors for TCP Queue but they don't kill the thread role:

(06/28/2024 15:05:43) [WRN] [tcpsend] Error flushing connection Celeste.Mod.CelesteNet.Server.ConPlusTCPUDPConnection(TCP/UDP xxx:52568 [con-tcpudp-xxx]) TCP data: System.InvalidOperationException: Not currently flushing queue 'TCP Queue'!
   at Celeste.Mod.CelesteNet.CelesteNetSendQueue.SignalFlushed() in /home/celestenet/CelesteNet-git/CelesteNet.Shared/ConnectionTypes/CelesteNetConnection.cs:line 252
   at Celeste.Mod.CelesteNet.Server.ConPlusTCPUDPConnection.FlushTCPSendBuffer() in /home/celestenet/CelesteNet-git/CelesteNet.Server/ConPlus/TCPUDPConnection.cs:line 425
   at Celeste.Mod.CelesteNet.Server.ConPlusTCPUDPConnection.FlushTCPSendQueue() in /home/celestenet/CelesteNet-git/CelesteNet.Server/ConPlus/TCPUDPConnection.cs:line 388
   at Celeste.Mod.CelesteNet.Server.TCPUDPSenderRole.Worker.StartWorker(CancellationToken token) in /home/celestenet/CelesteNet-git/CelesteNet.Server/ConPlus/TCPUDPSenderRole.cs:line 88

For UDP queue these look like this in the log:

(06/28/2024 05:26:27) [WRN] [udpsend] Error flushing connection Celeste.Mod.CelesteNet.Server.ConPlusTCPUDPConnection(TCP/UDP xxx:58487 [con-tcpudp-xxx]) UDP queue: System.InvalidOperationException: Not currently flushing queue 'UDP Queue'!
   at Celeste.Mod.CelesteNet.CelesteNetSendQueue.SignalFlushed() in /home/celestenet/CelesteNet-git/CelesteNet.Shared/ConnectionTypes/CelesteNetConnection.cs:line 252
   at Celeste.Mod.CelesteNet.Server.TCPUDPSenderRole.Worker.FlushUDPSendQueue(ConPlusTCPUDPConnection con, CancellationToken token) in /home/celestenet/CelesteNet-git/CelesteNet.Server/ConPlus/TCPUDPSenderRole.cs:line 164
   at Celeste.Mod.CelesteNet.Server.TCPUDPSenderRole.Worker.StartWorker(CancellationToken token) in /home/celestenet/CelesteNet-git/CelesteNet.Server/ConPlus/TCPUDPSenderRole.cs:line 88

(06/28/2024 05:26:27) [INF] [main] Disconnecting: Celeste.Mod.CelesteNet.Server.ConPlusTCPUDPConnection(TCP/UDP xxx:58487 [con-tcpudp-xxx])
(06/28/2024 05:26:27) [INF] [playersession] Shutdown #26005 Celeste.Mod.CelesteNet.Server.ConPlusTCPUDPConnection(TCP/UDP xxx:58487 [con-tcpudp-xxx])

(06/28/2024 05:26:27) [CRI] [netplus] Error in thread pool thread 6: System.InvalidOperationException: Not currently flushing queue 'UDP Queue'!
   at Celeste.Mod.CelesteNet.CelesteNetSendQueue.SignalFlushed() in /home/celestenet/CelesteNet-git/CelesteNet.Shared/ConnectionTypes/CelesteNetConnection.cs:line 252
   at Celeste.Mod.CelesteNet.Server.TCPUDPSenderRole.Worker.StartWorker(CancellationToken token) in /home/celestenet/CelesteNet-git/CelesteNet.Server/ConPlus/TCPUDPSenderRole.cs:line 88
   at Celeste.Mod.CelesteNet.Server.NetPlusThread.ThreadLoop() in /home/celestenet/CelesteNet-git/CelesteNet.Server/NetPlus/ThreadPool.cs:line 196
(06/28/2024 05:26:27) [DBG] [netplus] Restarting thread pool thread 6
(06/28/2024 05:26:27) [DBG] [netplus] Thread pool thread 6 starting role worker for role Celeste.Mod.CelesteNet.Server.TCPUDPSenderRole

Both of these happen in here:

                            try {
                                switch (act) {
                                    case SendAction.FlushTCPQueue: {
                                        con.FlushTCPSendQueue();
                                    } break;
                                    case SendAction.FlushTCPBuffer: {
                                        con.FlushTCPSendBuffer();
                                    } break;
                                    case SendAction.FlushUDPQueue: {
                                        FlushUDPSendQueue(con, token);
                                    } break;
                                }
                            } catch (Exception e) {
                                switch (act) {
                                    case SendAction.FlushTCPQueue:
                                    case SendAction.FlushTCPBuffer: {
                                        if (e is SocketException se && se.IsDisconnect()) {
                                            Logger.Log(LogLevel.INF, "tcpsend", $"Remote of connection {con} closed the connection");
                                            con.DisposeSafe();
                                            continue;
                                        }

                                        Logger.Log(LogLevel.WRN, "tcpsend", $"Error flushing connection {con} TCP data: {e}");
                                        con.DisposeSafe();
                                    } break;
                                    case SendAction.FlushUDPQueue: {
                                        Logger.Log(LogLevel.WRN, "udpsend", $"Error flushing connection {con} UDP queue: {e}");
                                        try {
                                            con.UDPQueue.SignalFlushed();
                                        } catch (InvalidOperationException) {
                                            // this just means the exception that SignalFlushed can throw most likely got us here in the first place
                                        }
                                        con.DecreaseUDPScore(reason: "Error flushing queue");
                                    } break;
                                }
                            }

for the FlushTCP... SendActions it handles things just fine and disposes the connection with the flush error and moves on.

For UDP it looks like what can happen is that UDPQueue.SignalFlushed() inside of FlushUDPSendQueue can throw its "Not currently flushing" Exception, but then the Exception handling here tries to UDPQueue.SignalFlushed() again, which then throws the same exception again, killing the role worker.
The code above is the fix I thought should likely work, to just catch the InvalidOperationException when attempting to SignalFlushed in the catch block, if that makes sense. Then everything proceeds as normal, the UDP score decreases, the worker moves on.

Yet another ReadLine fix

This is related to #125

We only ever removed the usage of StreamReader from the client-side teapot handshake, but the server has still been buffering up to 1024 bytes of the network stream and potentially discarding them. I suspect we're just lucky that the clients stop writing anything after initiating the handshake and just wait for the full response?

The ReadLine implementation that was in the client is now in CelesteNetUtils.UnbufferedReadLine and the return new string(CollectionsMarshal.AsSpan(lineChars)); has become return Encoding.UTF8.GetString(CollectionsMarshal.AsSpan(lineChars));.
This fixes this totally important bug as well:
image

image

@RedFlames RedFlames changed the title Mitigate thread pool role thread restarts Mitigate thread pool role thread restarts (+UnbufferedReadLine fix) Jul 6, 2024
@RedFlames RedFlames merged commit 2ad5b60 into 0x0ade:main Jul 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant