Allow preads from non-owner processes #9250

nickva · 2024-12-26T22:32:46Z

At the C API level pread [1] can be called from multiple threads. In Erlang, however, all the calls must be made from a single owner process.

In a busy concurrent environment that means all operations for a file descriptor must be serialized, and could pile up in the controller's mailbox having to wait for potentially slower operations to complete.

To fix it, add the ability to make raw concurrent pread calls from Erlang as well. File descriptor lifetime is still tied to the owner process, so it gets closed when owner dies. Other processes are only allowed to make pread calls, all others still require going through the owner.

Other file layers, like compression and delayed IO, still keep the previous behavior. They have their own get_fd_data/1 functions per layer which check controller ownership. Concurrent preads are not allowed in those layers.

In unix_prim_file.c the seek+read fallback would have required exposing a flag in Erlang in order to keep the old behavior since another process could see the temporarily changed position. However, before adding a new flag looked where pread might not be supported, and it seems most Unix-like OSes since Sun0S 5(Solaris 2.5) and AT&T Sys V Rel4 (so all modern BSD) seem to have it. So perhaps, it's safe to remove the fallback altogether and simplify the code? As a precaution kept a configure check with an early failure and a clear message about it.

This necessitates updating preloaded beam files so an OTP team member would have to take over the PR, if the idea is acceptable to start with. I didn't commit the beam files so any CI tests might not run either?

Issue: #9239

[1] https://www.man7.org/linux/man-pages/man2/pread.2.html

At the C API level pread [1] can be called from multiple threads. In Erlang, however, all the calls must be made from a single owner process. In a busy concurrent environment that means all operations for a file descriptor must be serialized, and could pile up in the controller's mailbox having to wait for potentially slower operations to complete. To fix it, add the ability to make raw concurrent pread calls from Erlang as well. File descriptor lifetime is still tied to the owner process, so it gets closed when owner dies. Other processes are only allowed to make pread calls, all others still require going through the owner. Other file layers, like compression and delayed IO, still keep the previous behavior. They have their own `get_fd_data/1` functions per layer which check controller ownership. Concurrent preads are not allowed in those layers. In unix_prim_file.c the seek+read fallback would have required exposing a flag in Erlang in order to keep the old behavior since another process could see the temporarily changed position. However, before adding a new flag looked where pread might not be supported, and it seems most Unix-like OSes since Sun0S 5(Solaris 2.5) and AT&T Sys V Rel4 (so all modern BSD) seem to have it. So perhaps, it's safe to remove the fallback altogether and simplify the code? As a precaution kept a configure check with an early failure and a clear message about it. This necessitates updating preloaded beam files so an OTP team member would have to take over the PR, if the idea is acceptable to start with. I didn't commit the beam files so any CI tests might not run either? Issue: erlang#9239 [1] https://www.man7.org/linux/man-pages/man2/pread.2.html

github-actions · 2024-12-26T22:33:37Z

CT Test Results

3 files 141 suites 50m 11s ⏱️
1 598 tests 1 549 ✅ 49 💤 0 ❌
2 307 runs 2 238 ✅ 69 💤 0 ❌

Results for commit f0fb6e7.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

nickva · 2024-12-26T22:41:43Z

Not sure about this approach and would like some feedback from the OTP team, so I didn't add any tests or documentation yet.

I suspect file_handle_wrapper state machine logic might need updated?
Could this approach work for pwrites as well? With the read buffer reset on pwrite calls, it might be tricky, but I saw there was a read buffer try_lock call, maybe something can be done there too...

So far just tried a simple shell check:

(concurrent-preads) % ./bin/cerl
Erlang/OTP 28 [DEVELOPMENT] [erts-15.2] [source-f4caf91470] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1]

1> {ok, Fd} = file:open("README.md", [raw, read, binary]).

{ok,{file_descriptor,prim_file,
                     #{handle => #Ref<0.2159425938.1955201055.257383>,
                       owner => <0.87.0>,
                       r_buffer => #Ref<0.2159425938.1955201036.257561>,
                       r_ahead_size => 0}}}

2> catch file:pread(Fd,2,3).
{ok,<<"[Er">>}

3> spawn(fun() -> erlang:display(file:pread(Fd,2,3)) end).
{ok,<<"[Er">>}
<0.91.0>

4> file:close(Fd).
ok

5> catch file:pread(Fd,2,3).
{error,einval}

6> spawn(fun() -> erlang:display(file:pread(Fd,2,3)) end).
{error,einval}
<0.95.0>

jhogberg · 2024-12-27T11:17:46Z

Thanks for the PR!

I suspect file_handle_wrapper state machine logic might need updated?

Definitely. It's not safe if there can be more than one operation in flight. If there is a concurrent operation of any kind (even "safe" ones like fstat or pread), closing must be delayed until the operation returns lest we risk races causing the wrong file descriptor to be used.

Unfortunately I have to close this PR since we can't accept it; the file module interface is too much of a mess to carve out more exceptions like this. However, I've got a major refactor planned in the backlog that will address the underlying problem by allowing all thread-safe calls to be made concurrently. The idea is to introduce a new interface altogether (working name fs) where the old stuff like compressed files, "file owners," remote files, list mode, and so on aren't a thing, and platform-specific things will be broken out into platform-specific modules rather than trying to cram everything under the same hood. I'll lobby for bumping the priority of this work :-)

nickva · 2024-12-30T22:16:33Z

Thanks for taking a look, John!

Yeah it makes sense about closing it. I didn't want to start messing with the state machine, it looked like it needed expanding with a new set of transitions, and I obviously didn't know what I was doing :-)

However, I've got a major refactor planned in the backlog that will address the underlying problem by allowing all thread-safe calls to be made concurrently.

I like the idea of a new interface. It seems we should be able to take advantage of inherent thread safety of some of the posix api calls. For us (Apache CouchDB) it's not just theoretical improvement, we see frequent process mailbox back-ups for the one single Fd owner. If this bumped the internal priority of the issue even a tiny bit, we're very appreciative!

nickva mentioned this pull request Dec 26, 2024

Raw file pread from multiple processes #9239

Open

jhogberg closed this Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow preads from non-owner processes #9250

Allow preads from non-owner processes #9250

nickva commented Dec 26, 2024

github-actions bot commented Dec 26, 2024 •

edited

Loading

nickva commented Dec 26, 2024

jhogberg commented Dec 27, 2024

nickva commented Dec 30, 2024

Allow preads from non-owner processes #9250

Allow preads from non-owner processes #9250

Conversation

nickva commented Dec 26, 2024

github-actions bot commented Dec 26, 2024 • edited Loading

CT Test Results

Artifacts

nickva commented Dec 26, 2024

jhogberg commented Dec 27, 2024

nickva commented Dec 30, 2024

github-actions bot commented Dec 26, 2024 •

edited

Loading