Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asyncio: Added async i/o APIs. #10605

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

icculus
Copy link
Collaborator

@icculus icculus commented Aug 28, 2024

This is still in-progress, but I've been heads-down on this for awhile, so I'm putting it in a draft PR.

This implements an async i/o API, and a "generic" backend that implements it with a pool of SDL threads that can block on synchronous SDL_IOStreams. This is likely "good enough" for most things, and it's usable on everything but single-threaded Emscripten builds.

The downsides to the generic backend:

  • While i/o happens in a background thread, you can't do two accesses in the same open file in parallel, because the SDL_IOStream needs to seek for each, so file position state would race. There's a mutex protecting it per-opened-file. Separate files (even two opens of the same file) can work in parallel, though.
  • It can't cancel in-flight i/o; if a read/write is happening, it can't be stopped before it completes. In practice, this isn't likely a big deal.
  • There are limited threads for i/o (currently (SDL_GetCPUCount * 2) +1 at most), and it uses the same thread pool that is used for notifying apps that i/o is complete, which is to say if the app uses that callback for heavy processing, until the callback returns, it won't be available for doing more i/o. In practice, you'd have to have lots of read/writes that also have very heavy callbacks before you'd see bottlenecks, I think, but system-specific APIs can just get all the i/o cooking separately regardless of callback weight. (EDIT: this isn't true now. The generic backend uses up to 8 threads for i/o, but they are only limited by disk bandwidth, and no longer have to worry about the app hogging them in a callback.)
  • It can't ever work for single-threaded Emscripten, because it needs threads. (EDIT: it works in single-threaded in Emscripten now...it's just synchronous, because Emscripten's filesystem API is synchronous and almost always just a memcpy from MEMFS anyhow).

This would benefit from a Windows backend, that uses win32 IO Completion Ports, to solve these limitations. Linux can use io_uring if that's generally available at this point. Emscripten could use actual async file APIs and run stuff between mainloop iterations if single-threaded (EDIT: see above), or hand off from there to the usual thread pool if not. I'm sure Mac/iOS/Android have a thing, too, but the ABI wouldn't change, so we can deal with those later.

Fixes #1374.

@icculus icculus added this to the 3.0 ABI milestone Aug 28, 2024
@namandixit
Copy link

Any specific reason why a callback-based system was chosen?

Alternative would be SDL_ReadIOAsync, etc. returning a handle on which the application could poll to ask if the operation was finished. This is the interface I have in my engine, and I feel it is more natural for loading screens, etc. (store handles for all in-flight operations in an array, iterate and poll every frame, keep showing the loading screen unless all return DONE).

With callback based system, this gets complex (maintain a multithread-safe hash-table for all operations, callback sets the value DONE for asset ID as key, application iterates over the hash table); especially if the platform layer and game layer are reasonably decoupled.

@icculus
Copy link
Collaborator Author

icculus commented Aug 28, 2024

This is one of those "you can't please everyone" things. I started with a poll interface and was asked to change it to callbacks.

You could just have an array of bools and have the callback set each one to true as pieces finish loading, and each frame check for true instead of DONE...or something like that.

@slouken
Copy link
Collaborator

slouken commented Aug 28, 2024

This generally looks good. Is there going to be WriteFileAsync()? What's the interaction with storage?

@nfries88
Copy link

nfries88 commented Aug 28, 2024

Looks pretty good. Do have some input

  • There are limited threads for i/o (currently (SDL_GetCPUCount * 2) +1 at most)

While this is probably fine heuristically, I would probably add a hard maximum of 64. There's a lot of people gaming on old workstations that would exceed that cap and be unlikely to benefit from more in the typical case of I/O on a single drive.

and it uses the same thread pool that is used for notifying apps that i/o is complete... in practice, you'd have to have lots of read/writes that also have very heavy callbacks before you'd see bottlenecks, I think

this is fine. a note in the documentation that callbacks should try to be brief and forward heavy processing to other threads may be warranted, because many assets do need reasonably heavy processing.

This would benefit from ... IO Completion Ports ... io_uring ... Emscripten could use actual async file APIs

don't disagree but it might be more straightforward to start with an implementation that can portably share logic. A relatively thin shim over pread/pwrite on the Unix likes and ReadFile/WriteFile on Win32 can enable a single implementation with better characteristics than the generic implementation.

Emscripten would benefit most from its own implementation, IIRC all the libc synchronous I/O actually gets proxied to the main runtime thread in order to simplify handling the promises from the underlying JS interfaces.

I'm sure Mac/iOS/Android have a thing, too, but the ABI wouldn't change, so we can deal with those later.

Android doesn't have a thing for this.

Mac and iOS have dispatch_io which just doesn't mesh well with this interface (in particular there is no way to supply a read buffer) and at least when it was new wasn't any better for regular files than a simpler threadpool anyway

@slouken
Copy link
Collaborator

slouken commented Aug 28, 2024

Crazy thought... do you actually want that many I/O threads? Maybe we should start with one and add a hint that lets applications tune that?

@nfries88
Copy link

Crazy thought... do you actually want that many I/O threads? Maybe we should start with one and add a hint that lets applications tune that?

having a fixed number of not enough threads makes the system unusable for many real projects and having a few too many idle threads is always cheaper than creating them on-demand. This is what posix aio implementations generally got wrong and why nobody really uses them for heavy I/O (except for FreeBSD's implementation I guess).

tunability as a range with a heuristic selection within that range might be appropriate. All of the factors that would determine the optimal number of threads for an I/O-busy program are not really statically knowable, but a user could profile their particular I/O pattern across a good variety of machines and come up with a range that might be better than just relying on the heuristic alone.

@icculus
Copy link
Collaborator Author

icculus commented Aug 28, 2024

This generally looks good. Is there going to be WriteFileAsync()? What's the interaction with storage?

LoadFileAsync is useful because it has open the file, figure out how large the file is, allocate memory for it, and read into it. WriteFileAsync would only need to open and write an existing buffer, so it's only saving one step. But we could definitely add it!

Storage just needs an equivalent of LoadFileAsync. On the built-in implementation, it literally will just call SDL_LoadFileAsync. Current thinking was not to implement positional reads/writes at the storage level, but I'm still trying to understand what data in a SDL_Storage interface does when you have a file like Spider-Man 2's New York map, which is presumably dozens of gigabytes.

Crazy thought... do you actually want that many I/O threads? Maybe we should start with one and add a hint that lets applications tune that?

Right now it spins threads on-demand if there aren't any idle ones, up to a maximum count, and ones that are idle for more than 30 seconds will start terminating until only one is left (we always keep one around so there isn't a failure case when no threads can start). Until you use async i/o, it won't initialize anything so it doesn't have a resource cost at all until you touch it. The idea being that you likely have a flood of startup loading and then there's not a lot of value in keeping idle threads around.

There's a FIXME to add a hint, and we definitely should tune the defaults in any case.

@icculus
Copy link
Collaborator Author

icculus commented Aug 29, 2024

Okay, single-threaded Emscripten support is in, so this is Good Enough to ship and do improvements with system-specific APIs later...but do we really want this callback API?

It forces people deal with locking and worries that they'll clog the thread pool if they don't move quickly enough.

Maybe it would be better if we give them something like:

SDL_GetAsyncResults(&details)

Internally we keep a queue of completed tasks, and each call to this returns the next completed one with all the details. You don't poll specific tasks, you just get the next one in the completed list. Call doesn't block, returns false if nothing is ready.

This dramatically simplifies SDL's code and the mental load an app needs to manage. The app can still toss work to a background thread to process the loaded data if they want...but if they just didn't want to block on i/o, they can otherwise still benefit in a single-threaded app.

@slouken
Copy link
Collaborator

slouken commented Aug 29, 2024

Sure, I buy it, let’s do that instead.

@slouken
Copy link
Collaborator

slouken commented Aug 29, 2024

The one thing that would be nice is a way to wait for multiple I/O operations to complete. WaitForCompletion(array of operations)

@slouken
Copy link
Collaborator

slouken commented Aug 29, 2024

You probably want a separate handle for each request. For example if a library does async loading it wants to get its own data separate from the app.

@nfries88
Copy link

nfries88 commented Aug 30, 2024

Internally we keep a queue of completed tasks, and each call to this returns the next completed one with all the details. You don't poll specific tasks, you just get the next one in the completed list. Call doesn't block, returns false if nothing is ready.

Allowing for separate completion queues (probably specified when creating an async file object, with NULL refering to some default completion queue) would probably be the most useful form of this approach, since different threads would have different interests. That also conveniently translates very directly to how IOCP and io_uring both deal with completions. Should also be convenient for tying I/O ops directly to different asset pipelines.

@namandixit
Copy link

namandixit commented Aug 30, 2024

I think allowing separate queues of I/O operations and then enumerating within each queue with SDL_GetAsyncResults(queue, &details) will be the best option. If someone doesn't want to store all handles, they can just use a single queue. If someone wants to deal with each I/O operation separately, they can have one queue per operation. And it allows future features and optimisations (queue priority, separate queue for local vs remote resources, different queues for different LOD levels, decompression on particular queues, etc.).

@icculus icculus force-pushed the sdl3-asyncio branch 2 times, most recently from d04bd1b to 5e312e7 Compare September 1, 2024 22:24
@icculus
Copy link
Collaborator Author

icculus commented Sep 1, 2024

Okay, this now has queryable tasks and "task groups" and the background thread callbacks are gone.

I'm super happy with how this came out. Here's the category documentation from the new version, which gives the basic idea:

CategoryIOAsync

SDL offers a way to perform i/o asynchronously. This allows an app to read or write files without waiting for data to actually transfer; the functions that request i/o never block while the request is fulfilled.

Instead, the data moves in the background and the app can check for results at their leisure.

This is more complicated that just reading and writing files in a synchronous way, but it can allow for more efficiency, and never having framerate drops as the hard drive catches up, etc.

The general usage pattern for async i/o is:

  • Open a file with SDL_IOAsyncFromFile.
  • Start i/o tasks to that file with SDL_ReadIOAsync or SDL_WriteIOAsync.
  • Later on, use SDL_GetIOAsyncTaskResult to see if the task is finished.
  • When your tasks are done, close the file with SDL_CloseIOAsync.

If you want to track several tasks at once, regardless of the order in which they complete:

  • Create a task group with SDL_CreateIOAsyncGroup.
  • Specify this group when calling SDL_ReadIOAsync or SDL_WriteIOAsync.
  • Use SDL_GetIOAsyncGroupResult to see if anything has finished yet.

This all works, without blocking, in a single thread, but one can also use a task group in a background thread and sleep until new results have arrived:

  • Call SDL_WaitIOAsyncGroup from one or more threads to efficiently block until new tasks complete.
  • When shutting down, call SDL_SignalIOAsyncGroup to unblock sleeping threads despite there being no new tasks completed.

And, of course, to match the synchronous SDL_LoadFile, we offer SDL_LoadFileAsync as a convenience function. This will handle allocating a buffer, slurping in the file data, and null-terminating it; you still get a task handle to check later or add to a task group.

@slouken
Copy link
Collaborator

slouken commented Sep 1, 2024

I'm super happy with how this came out. Here's the category documentation from the new version, which gives the basic idea:

Nice!

@icculus icculus marked this pull request as ready for review September 1, 2024 23:09
@jaedan
Copy link

jaedan commented Sep 2, 2024

Hi - I was sent this direction because I've been peripherally involved in SDL_Gpu (doing reviews focused on performance and answering questions about PCI and DMA). I happen to build storage APIs professionally though, so looking at this pull request is far more up my alley. I had a really long review I was working on, but the recent changes you've made addressed about 80% of what I was going to say so this will now be much shorter (but still probably long).

I think for now I'll stick to just public API feedback, and maybe I'll post in a few separate posts.

First, I do not think it is possible to implement the API calls that do not require a task group efficiently on Linux. On Linux, your asynchronous I/O choices are either Linux AIO (not to be confused with POSIX AIO) or io_uring. Both of these APIs require you to first create some per-thread context (e.g. a task group) in order to submit I/O. Creating these on demand per-file is expensive and may run into OS-imposed limits on the number of io_urings or aio_contexts available. I strongly suggest that you only expose the APIs that require a task group - it's the only way to get true asynchronous I/O. I've personally made a similar mistake with APIs released 10+ years ago and it was really painful to find my way out of the corner I painted myself into.

Second, opening and closing the file should be asynchronous operations too (and they should be submitted to the task group). These can actually secretly block a long time in some cases. Of course, on many operating systems there isn't an asynchronous open/close option at all, and the implementations there will be blocking. But at least on Linux with io_uring you can asynchronously open and close files, so the API should allow for the best behavior rather than forcing the worst.

More review to follow.

@icculus
Copy link
Collaborator Author

icculus commented Sep 2, 2024

(Sorry, I just squashed this down to a single commit right before you posted that; it's all the same stuff, just in one commit.)

I'm skeptical that async opens are worth the API complexity, but I can totally see a close operation blocking while cached writes make their way to disk. I'm nervous about adding the complexity, but let me think on this a little tonight.

As for tasks vs task groups: presumably with io_uring (and Win32 i/o completion ports), we'll have a single internal thread that lives just to deal with moving data from the OS in a single context/completion port to our own task groups that mostly sleeps. If that's not a terrible idea, then it doesn't matter if the app wants to query a single task or a task group, it's just a question of whether they look at a struct directly or shift the head of a linked list of structs before looking at it.

@icculus
Copy link
Collaborator Author

icculus commented Sep 2, 2024

(Also, more review is totally welcome at any length you're willing to write, because I'm 100% winging it over here like I know what I'm doing.)

@jaedan
Copy link

jaedan commented Sep 2, 2024

The shift to support "task groups" is an excellent and necessary change. However, I do still have a few pieces of feedback on the group task group design.

  1. One major source of overhead in storage APIs is the system calls themselves. The only way to work around this is to batch operations and submit them in a single system call. Some operating systems now have facilities for this (e.g. io_uring, and I think there's something on Windows called ioRing now). But in your API it isn't really clear whether this sort of batching would be allowed. I guess you could make an SDL_ReadIOAsync call simply queue onto the group and then only really submit the operation to the OS in SDL_GetIOAsyncGroupResult, but I think as the functions are named now that isn't clear enough to the user that they need to call SDL_GetIOAsyncGroupResult once to even start the operations. I think you need a separate SDL_QueueReadIOAsync and then an SDL_SubmitIO type of set up.

  2. Task groups should NOT be thread safe. The whole purpose of io_uring, or Linux aio, or Windows ioRing (and to some extent iocp), or even NVMe multiple submission/completion queues or Linux blk-mq is that you don't need locks. Instead, instruct users to create one task group per thread and use it for every operation on that thread for it's entire duration. This needs to work essentially the same way as SDL_Gpu's command buffers because the underlying storage hardware happens to work exactly the same way as the underlying GPU hardware in terms of command submission and completion.

  3. The name task group indicates to me something lighter weight than what I think this really needs to represent. A task group needs to represent a Linux io_uring, or a Windows iocp, etc. These are not cheap things to create and destroy. In fact, we only want to do it once per thread and at thread start up time, then re-use it. Most of the modern OS APIs call it a "ring" (because it's literally a circular submission ring of commands) or a "queue". I've historically called it an "io_channel".

(More coming)

@jaedan
Copy link

jaedan commented Sep 2, 2024

As for tasks vs task groups: presumably with io_uring (and Win32 i/o completion ports), we'll have a single internal thread that lives just to deal with moving data from the OS in a single context/completion port to our own task groups that mostly sleeps. If that's not a terrible idea, then it doesn't matter if the app wants to query a single task or a task group, it's just a question of whether they look at a struct directly or shift the head of a linked list of structs before looking at it.

Using a thread is, unequivocally, a terrible idea (sorry, that sounds really harsh and I promise I don't mean it to be, but I've been at this a long time and everyone wants to use a thread). The hardware (and I've helped design NVMe from close to the beginning) wants you to create a separate "queue" per thread. The hardware accepts command submissions in parallel (and they're all asynchronous too!). And the operating systems have spent the last decade re-designing themselves to do parallel, asynchronous I/O internally to match the hardware.

You are supposed to create a per-thread "context" (io_uring, aio_context, ioRing, iocp, whatever) at thread start up time, and use that exclusively for the I/O operations you perform on that thread. This will allow for parallel, lockless I/O submission and provide the lowest latency.

@nfries88
Copy link

nfries88 commented Sep 2, 2024

I'm skeptical that async opens are worth the API complexity

I have noticed file open latency before. I've never seen it be significant for a single file, and the many-file usecase is probably covered adequately enough by SDL_LoadFileAsync and perhaps similar composite operations. I'd agree its probably not worth the complexity.

but I can totally see a close operation blocking while cached writes make their way to disk. I'm nervous about adding the complexity, but let me think on this a little tonight.

An async sync-and-close would be useful for savefiles. System APIs typically will not wait for cached writes to hit the disk before closing a file, having this behavior be explicit as a composite operation is useful.

As for tasks vs task groups: presumably with io_uring (and Win32 i/o completion ports), we'll have a single internal thread that lives just to deal with moving data from the OS in a single context/completion port to our own task groups that mostly sleeps. If that's not a terrible idea, then it doesn't matter if the app wants to query a single task or a task group, it's just a question of whether they look at a struct directly or shift the head of a linked list of structs before looking at it.

Windows will block inside an Overlapped ReadFile when the underlying disk driver's queue is full. It's not that hard to hit this with random reads because Windows disk drivers have shallow queues compared to eg Linux. This is unfortunate because other pending reads to the same device might still succeed without blocking by hitting the filesystem cache, and obviously when combined with other uses of Overlapped I/O (eg sockets) that should never block this is even worse. It will also block on Overlapped WriteFile for writes that extend a file, which is a pretty normal thing to happen. Because of this behavior the typical procedure is to still use a threadpool to issue overlapped I/O. Probably a lesser concern, but Wine also doesn't attempt to provide asynchronous I/O to regular files at all.

As far as io_uring goes, I definitely agree with jaedan that this would be a suboptimal use of it, but also think that the typical use of a per-thread ring is both needlessly resource-expensive (they are not lightweight objects) and a poor fit for games' logic. A ring wrapped in a mutex for each task group is probably the least bad approach here, although I definitely lack jaedan's level of experience with io_uring. Extending the API with batching operations would probably bring that kind of implementation closer to optimal (also an improvement for other implementations) but that can be a later enhancement IMO.

@slouken
Copy link
Collaborator

slouken commented Sep 2, 2024

We could create a heavy weight async I/O context when a thread first uses the async API. This could be stored in TLS and be automatically cleaned up, so the users get the benefit of that without having to explicitly track things themselves.

It might be worth adding a io_uring and I/O ring implementation to validate that this API works efficiently with those interfaces.

The only thing that affects the ABI is the question of how asynchronous I/O interacts with the storage API, and it's becoming clear that it doesn't. This seems like it's intended as a low level high performance interface to disk files. If that's true, then let's bump this out of the ABI milestone.

@jaedan
Copy link

jaedan commented Sep 2, 2024

Because of this behavior the typical procedure is to still use a threadpool to issue overlapped I/O.

On Linux for many many years (via Linux aio) and on Windows for the last 3, this is no longer the standard procedure. We shouldn't make the API inefficient on modern operating systems by design, but rather accept the inefficiency only for operating systems with poor asynchronous IO support.

Let's imagine we have a system with a background thread that does the IO processing. This thread may spin while there's IO outstanding for efficiency, but it certainly can't just spin all the time. IO in games is sporadic, so most commonly this thread will be asleep when we next start a burst of IO. So we queue up the operation into the shared queue or ring and then signal the background thread to wake. On most operating systems, thread scheduling latency is only at about 1ms granularity, and even if it reacts instantly just the mechanics of restoring thread state to wake it takes a significant chunk of time. So all of this added time is added to our IO request. But then the underlying SSD will complete the request in 20-100 microseconds most likely. Why would we 10x our latency because of software design? It makes no sense.

but also think that the typical use of a per-thread ring is both needlessly resource-expensive (they are not lightweight objects) and a poor fit for games' logic

I really can't disagree more on both points. These objects are not cheap to create because they usually do memory allocations and system calls, but actual resource use is very minimal. Usually it's 64 bytes * number of queues entries. And you make one of these per thread that wants to perform IO. If you have true asynchronous IO, then you only need a very limited number of threads in your game that are doing asset loading. We're talking kilobytes of total resources use here.

And regarding a poor fit for game logic, a ring buffer for command submission is a perfect fit for game logic. On each tick you enqueue all your IO operations. At end of tick you submit and check for completions. It cannot be more ideal. It also happens to be exactly how the GPU API works too, and it would be how an optimal networking API would function as well.

@nfries88
Copy link

nfries88 commented Sep 6, 2024

  • We need to figure out how to make this work with SDL_Storage, which probably means exposing both SDL_AsyncIOInterface and SDL_AsyncIOQueueInterface, telling storage people to implement both if they aren't working with raw files, and adding code to make sure you can only add a task on a compatible queue. I'm not sure what that would look like yet. I assume most implementers won't bother.

probably simpler to just expose a function to add an arbitrary completion to a queue so storage implementations can do their own thing and not worry about having incompatible queue types.

@icculus icculus force-pushed the sdl3-asyncio branch 2 times, most recently from c14d810 to 3744c5b Compare September 17, 2024 16:42
@icculus icculus force-pushed the sdl3-asyncio branch 4 times, most recently from 5e89af8 to eabc72f Compare September 30, 2024 04:40
@icculus
Copy link
Collaborator Author

icculus commented Oct 1, 2024

Linux io_uring support is in. This is something I was planning to knock out real fast this morning, so of course it took all day, but it shows the existing API maps pretty well to it.

(@nfries88 and @jaedan, feel free to take a look if you're interested.)

@icculus
Copy link
Collaborator Author

icculus commented Oct 1, 2024

Also:

Force a sync during close: maybe. Since it's not synchronous, it's not unreasonable, but we'd have to add an fsync method to SDL_IOStream so the generic backend can use it.

SDL_IOStream syncs files to disk in SDL_FlushIO() now (with no implicit flush if you just SDL_CloseIO() it), so this is now doable.

I'm still inclined to just sync all files during the async close work, since it's doesn't have to block, and otherwise the app is going to have to make sure the write completes before a flush is queued, so they don't run in the wrong order.

@nfries88
Copy link

nfries88 commented Oct 2, 2024

io_uring's completion side is not thread-safe either, and racier than the submission side because the CQE isn't "removed" until you call io_uring_cqe_seen(), eg multiple threads may wind up getting the same cqe from io_uring_wait_cqe.

So everything from io_uring_wait_cqe/io_uring_peek_cqe until io_uring_cqe_seen needs to be inside a mutex if we want to support getting completions from the same queue from multiple threads.

note that io_uring_wait_cqe is basically equivalent to io_uring_enter2 followed by io_uring_peek_cqe, so you could separate the calls and allow multiple waiters without blocking peekers.

I'll probably do a fuller analysis later but that's the one thing I noticed doing a quick skim.

ps: now that you have an implementation that can clearly benefit from it, maybe operation batching is the next thing to look at?

@icculus
Copy link
Collaborator Author

icculus commented Oct 3, 2024

So everything from io_uring_wait_cqe/io_uring_peek_cqe until io_uring_cqe_seen needs to be inside a mutex if we want to support getting completions from the same queue from multiple threads.

Ugh, I assumed seen tells the kernel that it can reuse that slot to fill in a new completed request, not that it will keep giving out the same cqe to everyone waiting until it's marked as seen, but you're right, that's totally how it works. I'll fix that up.

ps: now that you have an implementation that can clearly benefit from it, maybe operation batching is the next thing to look at?

If there's a reasonable way to do this in the API without it getting super-ugly, sure. But at that point we're mostly talking about reducing the number of syscalls, right? Doing so is good, but we're not trying to solve the C10k problem, so it's not high-priority for me.

@icculus icculus modified the milestones: 3.0 ABI, 3.2.0 Oct 3, 2024
@icculus
Copy link
Collaborator Author

icculus commented Oct 3, 2024

Bumping this to 3.2.0, since the ABI lock is about to happen (we can add new stuff after ABI lock).

@nfries88
Copy link

nfries88 commented Oct 4, 2024

Ugh, I assumed seen tells the kernel that it can reuse that slot to fill in a new completed request, not that it will keep giving out the same cqe to everyone waiting until it's marked as seen, but you're right, that's totally how it works. I'll fix that up.

yeah, basically the entire liburing API is thread-unsafe except for the actual io_uring_enter syscalls. I'm not even 100% sure that it's safe to submit from one thread and reap from another, but it seems to work fine in practice.

But at that point we're mostly talking about reducing the number of syscalls, right? Doing so is good, but we're not trying to solve the C10k problem, so it's not high-priority for me.

that's the main advantage. But yeah the syscall overhead probably isn't significant enough that it would turn up when profiling.

@icculus
Copy link
Collaborator Author

icculus commented Oct 4, 2024

I think the io_uring thread issues are settled in 7d4fdab. I put a separate mutex on the submission and completion queues, and the completion side copies the data out and unlocks as fast as possible...so any serious contention probably won't be an issue in any reasonable case, and the wait() call just uses io_uring's API to block until there's new results, but then it grabs a mutex and uses the peek codepath to make sure that only one thread gets a valid event.

If the kernel wakes up multiple waiters when there's only one new event, the existing API allows for this, so I think this part is solid now.

@icculus
Copy link
Collaborator Author

icculus commented Oct 4, 2024

I'm wondering if we should just change...

extern SDL_DECLSPEC SDL_AsyncIOTask * SDLCALL SDL_CloseAsyncIO(SDL_AsyncIO *asyncio, SDL_AsyncIOQueue *queue, void *userdata);

...to have a bool flush option, since it's trivial for the backend to flush before the close, but an app would have to make sure all writes complete before issuing a flush command, lest the flush happen before the write.

We could also (or instead) add this to SDL_WriteAsyncIO.

Other options are always implicitly flush on close (easy for everyone, but less efficient for use-cases that could skip it), or add an explicit flush operation that the app is responsible for queueing correctly, or add a way to specify some operations need to serialize behind the scenes (this would be IOSQE_IO_LINK in io_uring, but this is sort of a pain and relies on future backends supporting it).

@icculus icculus force-pushed the sdl3-asyncio branch 3 times, most recently from d8adf13 to 269502a Compare October 4, 2024 08:47
@slouken
Copy link
Collaborator

slouken commented Oct 4, 2024

I'm wondering if we should just change...

extern SDL_DECLSPEC SDL_AsyncIOTask * SDLCALL SDL_CloseAsyncIO(SDL_AsyncIO *asyncio, SDL_AsyncIOQueue *queue, void *userdata);

...to have a bool flush option, since it's trivial for the backend to flush before the close, but an app would have to make sure all writes complete before issuing a flush command, lest the flush happen before the write.

This seems like a good idea.

@nfries88
Copy link

nfries88 commented Oct 5, 2024

I'm wondering if we should just change...

extern SDL_DECLSPEC SDL_AsyncIOTask * SDLCALL SDL_CloseAsyncIO(SDL_AsyncIO *asyncio, SDL_AsyncIOQueue *queue, void *userdata);

...to have a bool flush option, since it's trivial for the backend to flush before the close, but an app would have to make sure all writes complete before issuing a flush command, lest the flush happen before the write.

that is perfectly adequate.

We could also (or instead) add this to SDL_WriteAsyncIO.

this would be a little more questionable. I suppose it's possible some user may want to sync a rarely written file to disk but also keep it open for the purposes of reading, but this is not a viable strategy for keeping the disk coherent if doing parallel writes to different portions of the file because fsync() will sync incomplete writes and also because later write may complete before an earlier write. Getting this right basically requires the implementation to serialize writes whenever a synchronizing write is queued, which is why databases etc tend to basically implement their own disk cache in-process and do direct I/O (which is inherently synchronized with the disk) instead.

Other options are always implicitly flush on close (easy for everyone, but less efficient for use-cases that could skip it)

probably harmless, but maybe someone would object.

or add an explicit flush operation that the app is responsible for queueing correctly

hairy for the same reasons as synchronizing writes.

add a way to specify some operations need to serialize behind the scenes (this would be IOSQE_IO_LINK in io_uring, but this is sort of a pain and relies on future backends supporting it).

I've never seen an async framework provide chained operations without some sort of hairiness either.

@icculus
Copy link
Collaborator Author

icculus commented Oct 6, 2024

Okay, optional flush/fsyncdata on close is in, 52c58b9.

Unrelated, a pending TODO item is to remove SDL_AsyncIOTask from the public headers, and just return bool instead of a task handle, and this makes sense with the current API.

But code is already implemented in both backends to attempt to cancel in-flight tasks...it's not currently used because the public API doesn't offer it. If we want to add it, we'll need the task handle in the public API. If we don't, we can lose the task handles and delete some code from the backends.

But if we don't offer it now, we can't add it later without adding replacement APIs that return a handle.

So which way do we want to go on this?

Canceling a task isn't guaranteed; if it hasn't started, both our threadpool and io_uring can cancel it, but otherwise they'll just have to wait for results once the work has gotten underway.

For files, this usually isn't a big deal, because presumably everything will finish in finite time. But if we ever open this up for something that can block indefinitely (like socket i/o), cancellation might become super-important. I'm 100% okay with this interface only ever being used on files, though.

@nfries88
Copy link

nfries88 commented Oct 6, 2024

For files, this usually isn't a big deal, because presumably everything will finish in finite time. But if we ever open this up for something that can block indefinitely (like socket i/o), cancellation might become super-important. I'm 100% okay with this interface only ever being used on files, though.

my super-opinionated and possibly unpopular take is that async I/O cancellability still doesn't really make sense for sockets except for automatically at program shutdown and maybe when closing the socket. If the connection is broken, the request in progress should error out.

It's also worth pointing out that while async I/O for sockets is the gold standard in scalability, if you're writing netcode for a FPS or something you definitely want kqueue/epoll/etc and non-blocking I/O over a threadpool, so the implementation should special-case sockets if it wants to be broadly useful for games. There's a few benchmarks that strongly suggest this is true for io_uring too (epoll methods being dramatically faster than io_uring for sockets at modest scales), but I do know some work has gone into improving this over the past couple years. A simple single threaded reactor model is just generally superior to the model we need for file I/O, especially in terms of latency, at small connection counts.

(Note that while it seems to have been largely ignored, Windows 10 got a more epoll-like socket readiness notification interface near the end of its cycle too.)

@slouken
Copy link
Collaborator

slouken commented Oct 6, 2024

Okay, optional flush/fsyncdata on close is in, 52c58b9.

Unrelated, a pending TODO item is to remove SDL_AsyncIOTask from the public headers, and just return bool instead of a task handle, and this makes sense with the current API.

But code is already implemented in both backends to attempt to cancel in-flight tasks...it's not currently used because the public API doesn't offer it. If we want to add it, we'll need the task handle in the public API. If we don't, we can lose the task handles and delete some code from the backends.

But if we don't offer it now, we can't add it later without adding replacement APIs that return a handle.

So which way do we want to go on this?

Canceling a task isn't guaranteed; if it hasn't started, both our threadpool and io_uring can cancel it, but otherwise they'll just have to wait for results once the work has gotten underway.

For files, this usually isn't a big deal, because presumably everything will finish in finite time. But if we ever open this up for something that can block indefinitely (like socket i/o), cancellation might become super-important. I'm 100% okay with this interface only ever being used on files, though.

I have mixed feelings on this. On the one hand it would be really nice to simplify the API. On the other hand I can totally see someone trying to cancel all pending I/O on shutdown or out of memory or something.

@slouken
Copy link
Collaborator

slouken commented Oct 6, 2024

Also, is it time for an IORing implementation? :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Asynchronous file I/O
5 participants