Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

ximinez
Copy link
Collaborator

@ximinez ximinez commented Feb 4, 2025

High Level Overview of Change

Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."

This rewrites the code so that the lock is not held during the callback. Instead it locks twice, once before, and once after. This is safe due to the structure of the code, but is checked after the second lock. This allows mutex_ to be changed back to a regular mutex.

Context of Change

From #4989:

The rotateWithLock function holds a lock while it calls a callback function that's passed in by the caller. This is a problematic design that needs to be used very carefully. In this case, at least one caller passed in a callback that eventually relocks the mutex on the same thread, causing UB (a deadlock was observed). The caller was from SHAMapStoreImpl, and it called clearCaches. This clearCaches can potentially call fetchNodeObject, which tried to relock the mutex.

This patch resolves the issue by changing the mutex type to a recursive_mutex. Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex.

Type of Change

  • Refactor (non-breaking change that only restructures code)

Test Plan

Testing can be the same as that for #4989, plus ensure that there are no regressions.

- Follow-up to #4989, which stated "Ideally, the code should be
  rewritten so it doesn't hold the mutex during the callback and the
  mutex should be changed back to a regular mutex."
Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 76.92308% with 9 lines in your changes missing coverage. Please review.

Project coverage is 78.2%. Comparing base (0968cdf) to head (3b6984c).

Files with missing lines Patch % Lines
src/xrpld/app/misc/SHAMapStoreImp.cpp 66.7% 5 Missing ⚠️
src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp 83.3% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           develop   #5276   +/-   ##
=======================================
  Coverage     78.1%   78.2%           
=======================================
  Files          790     790           
  Lines        67623   67643   +20     
  Branches      8163    8166    +3     
=======================================
+ Hits         52846   52864   +18     
- Misses       14777   14779    +2     
Files with missing lines Coverage Δ
src/xrpld/nodestore/DatabaseRotating.h 100.0% <ø> (ø)
src/xrpld/nodestore/detail/DatabaseRotatingImp.h 66.7% <ø> (ø)
src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp 69.7% <83.3%> (+9.0%) ⬆️
src/xrpld/app/misc/SHAMapStoreImp.cpp 75.5% <66.7%> (-1.0%) ⬇️

... and 2 files with indirect coverage changes

Impacted file tree graph

* Use a second mutex to protect the backends from modification
* Remove a bunch of warning comments
@ximinez ximinez requested a review from Bronek February 6, 2025 00:01
Comment on lines 83 to 85
// backendMutex_ is only needed when the *Backend_ members are modified.
// Reads are protected by the general mutex_.
std::mutex backendMutex_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this sounds like a typical single-write and one-or-more-read scenario, is it possible to use a single shared_mutex here instead of these two mutexes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, but there are risks. The biggest one is that I'd have to take a shared_lock at the start of rotateWithLock, and upgrade it to a unique_lock after the callback. If there is somehow ever a second caller to that function, or even a different caller that upgrades the lock, there is a potential deadlock.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bthomee @vvysokikh1 Ok, it took waaaaaaay longer than it should have because I kept trying clever things that didn't work or turned out unsupported, but I rewrote the locking, and changed to a shared mutex, and I think I've got a pretty foolproof solution here. And a unit test to exercise it.

But don't take my word for it. The point of code reviews is to spot the stuff I didn't consider.

Copy link
Collaborator

@vvysokikh1 vvysokikh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your solution is not completely solving the issue. It's still technically possible to deadlock (calling rotateWithLock from inside of the callback, this will cause a deadlock on your new mutex).

If it's good enough for now, please leave some comments to rotateWithLock() to warn any user of calling rotateWithLock() directly or indirectly from callback.

* upstream/develop:
  Updates Conan dependencies (5256)
- Rewrite the locking in DatabaseRotatingImp::rotateWithLock to use
  a shared_lock, and write a unit test to show (as much as possible)
  that it won't deadlock.
* upstream/develop:
  fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275)
  fix: issues in `simulate` RPC (5265)
Comment on lines +55 to +70
std::unique_lock writeLock(mutex_);
if (!rotating)
{
// Once this flag is set, we're committed to doing the work and
// returning true.
rotating = true;
}
else
{
// This should only be reachable through unit tests.
XRPL_ASSERT(
unitTest_,
"ripple::NodeStore::DatabaseRotatingImp::rotateWithLock "
"unit testing");
return false;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to lock mutex here? I would assume we can make rotating atomic bool and use compare_exchange to switch this flag safely.

Comment on lines +72 to +82
auto const writableBackend = [&] {
std::shared_lock readLock(mutex_);
XRPL_ASSERT(
rotating,
"ripple::NodeStore::DatabaseRotatingImp::rotateWithLock rotating "
"flag set");

return writableBackend_;
}();

auto newBackend = f(writableBackend->getName());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these lambda and read lock are actually required with current implementation. We are only using write lock before (which might be switched to atomic) and after. Assuming previous synchronization block switches rotating flag, there should be no other 'write' thread being able to proceed and capture writeLock while we are here.


clearCaches(validatedSeq);

return std::move(newBackend);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have changed the return type of rotateWithLock(), in the future this callback could be executed but false might be returned. In this case you have moved newBackend, and then clean it with setDeletePath().

Non-issue right now but maybe discard move unless you have strong perf concerns?

Comment on lines +64 to +69
// This should only be reachable through unit tests.
XRPL_ASSERT(
unitTest_,
"ripple::NodeStore::DatabaseRotatingImp::rotateWithLock "
"unit testing");
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment doesn't work. It can be reached not only with unit tests, but also by accidental concurrent call to rotateWithLock or indirect call to rotateWithLock from the callback.

// "Shared mutexes do not support direct transition from shared to unique
// ownership mode: the shared lock has to be relinquished with
// unlock_shared() before exclusive ownership may be obtained with lock()."
mutable std::shared_timed_mutex mutex_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for choosing timed mutex here? I believe shared_mutex would be enough here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants