Mitigate thread pool role thread restarts (+UnbufferedReadLine fix) #143
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Attempting to fix three places where thread roles can die unexpectedly and cause them to restart.
Edit: This now also has a ReadLine buffering related fix, see further down.
Within 2024 these are the
uniq -c
times these occurred:"The given key '...' was not present in the dictionary."
At one point in the TCPEPollPoller there used to be a
but got turned into
which can fail somehow.
Does
StartPolling
need to grabPollerLock.R()
at some point? I have no clue how this works :)Now I turned it back into a
TryGetValue
and log a warning when that fails, but proceed anyways."Couldn't poll the EPoll FD: 4"
So here we had, uh...
where the
do-while
can be exited by either the token cancellation request or anyret
value besidesEINTR
. I'm unsure whyEINTR
is allowed to go back into the poll wait if it wasn't from cancellation.When a cancellation does happen, I don't actually know if it does or doesn't cause an
EINTR
, I think it actually shouldn't, it just triggers an event on theCancelFD
. And that event just gets ignored because it'sint.MaxValue
'dSo I guess the throw just shouldn't happen for anEINTR
because it can only "break out" in conjunction with a cancellation - but the events should still be processed and then the outer while loop exits from cancellation?Edit: I'm saying nonsense things. What even is going on.
The definition of EINTR here in TCPEPollPoller is
= 4
.So what we've been doing is... if the return value is 4, which indicates 4 successful FDs ready... we ignore them and poll again.
What this most definitely meant to check for is...
&& ret < 0 && Marshal.GetLastWin32Error() == EINTR
, though I'm still unsure for the reason. 🙃"Not currently flushing queue 'UDP Queue'!"
First off, there's also these errors for TCP Queue but they don't kill the thread role:
For UDP queue these look like this in the log:
Both of these happen in here:
for the
FlushTCP...
SendActions it handles things just fine and disposes the connection with the flush error and moves on.For UDP it looks like what can happen is that
UDPQueue.SignalFlushed()
inside ofFlushUDPSendQueue
can throw its "Not currently flushing" Exception, but then the Exception handling here tries toUDPQueue.SignalFlushed()
again, which then throws the same exception again, killing the role worker.The code above is the fix I thought should likely work, to just catch the
InvalidOperationException
when attempting toSignalFlushed
in the catch block, if that makes sense. Then everything proceeds as normal, the UDP score decreases, the worker moves on.Yet another ReadLine fix
This is related to #125
We only ever removed the usage of StreamReader from the client-side teapot handshake, but the server has still been buffering up to 1024 bytes of the network stream and potentially discarding them. I suspect we're just lucky that the clients stop writing anything after initiating the handshake and just wait for the full response?
The ReadLine implementation that was in the client is now in
CelesteNetUtils.UnbufferedReadLine
and thereturn new string(CollectionsMarshal.AsSpan(lineChars));
has becomereturn Encoding.UTF8.GetString(CollectionsMarshal.AsSpan(lineChars));
.This fixes this totally important bug as well: