refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

ximinez · 2025-02-04T21:47:35Z

High Level Overview of Change

Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."

This rewrites the code so that the lock is not held during the callback. Instead it locks twice, once before, and once after. This is safe due to the structure of the code, but is checked after the second lock. This allows mutex_ to be changed back to a regular mutex.

Context of Change

From #4989:

The rotateWithLock function holds a lock while it calls a callback function that's passed in by the caller. This is a problematic design that needs to be used very carefully. In this case, at least one caller passed in a callback that eventually relocks the mutex on the same thread, causing UB (a deadlock was observed). The caller was from SHAMapStoreImpl, and it called clearCaches. This clearCaches can potentially call fetchNodeObject, which tried to relock the mutex.

This patch resolves the issue by changing the mutex type to a recursive_mutex. Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex.

Type of Change

Refactor (non-breaking change that only restructures code)

Test Plan

Testing can be the same as that for #4989, plus ensure that there are no regressions.

- Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."

codecov · 2025-02-04T22:09:11Z

Codecov Report

Attention: Patch coverage is 76.92308% with 9 lines in your changes missing coverage. Please review.

Project coverage is 78.2%. Comparing base (0968cdf) to head (3b6984c).

Files with missing lines	Patch %	Lines
src/xrpld/app/misc/SHAMapStoreImp.cpp	66.7%	5 Missing ⚠️
src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp	83.3%	4 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           develop   #5276   +/-   ##
=======================================
  Coverage     78.1%   78.2%           
=======================================
  Files          790     790           
  Lines        67623   67643   +20     
  Branches      8163    8166    +3     
=======================================
+ Hits         52846   52864   +18     
- Misses       14777   14779    +2

Files with missing lines	Coverage Δ
src/xrpld/nodestore/DatabaseRotating.h	`100.0% <ø> (ø)`
src/xrpld/nodestore/detail/DatabaseRotatingImp.h	`66.7% <ø> (ø)`
src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp	`69.7% <83.3%> (+9.0%)`	⬆️
src/xrpld/app/misc/SHAMapStoreImp.cpp	`75.5% <66.7%> (-1.0%)`	⬇️

... and 2 files with indirect coverage changes

src/xrpld/nodestore/DatabaseRotating.h

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp

* Use a second mutex to protect the backends from modification * Remove a bunch of warning comments

bthomee · 2025-02-06T15:51:24Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.h

+    // backendMutex_ is only needed when the *Backend_ members are modified.
+    // Reads are protected by the general mutex_.
+    std::mutex backendMutex_;


As this sounds like a typical single-write and one-or-more-read scenario, is it possible to use a single shared_mutex here instead of these two mutexes?

It's possible, but there are risks. The biggest one is that I'd have to take a shared_lock at the start of rotateWithLock, and upgrade it to a unique_lock after the callback. If there is somehow ever a second caller to that function, or even a different caller that upgrades the lock, there is a potential deadlock.

@bthomee @vvysokikh1 Ok, it took waaaaaaay longer than it should have because I kept trying clever things that didn't work or turned out unsupported, but I rewrote the locking, and changed to a shared mutex, and I think I've got a pretty foolproof solution here. And a unit test to exercise it.

But don't take my word for it. The point of code reviews is to spot the stuff I didn't consider.

vvysokikh1

I think your solution is not completely solving the issue. It's still technically possible to deadlock (calling rotateWithLock from inside of the callback, this will cause a deadlock on your new mutex).

If it's good enough for now, please leave some comments to rotateWithLock() to warn any user of calling rotateWithLock() directly or indirectly from callback.

* upstream/develop: Updates Conan dependencies (5256)

- Rewrite the locking in DatabaseRotatingImp::rotateWithLock to use a shared_lock, and write a unit test to show (as much as possible) that it won't deadlock.

* upstream/develop: fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275) fix: issues in `simulate` RPC (5265)

vvysokikh1 · 2025-02-10T13:14:59Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp

+        std::unique_lock writeLock(mutex_);
+        if (!rotating)
+        {
+            // Once this flag is set, we're committed to doing the work and
+            // returning true.
+            rotating = true;
+        }
+        else
+        {
+            // This should only be reachable through unit tests.
+            XRPL_ASSERT(
+                unitTest_,
+                "ripple::NodeStore::DatabaseRotatingImp::rotateWithLock "
+                "unit testing");
+            return false;
+        }


Why do we need to lock mutex here? I would assume we can make rotating atomic bool and use compare_exchange to switch this flag safely.

vvysokikh1 · 2025-02-10T13:25:10Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp

+    auto const writableBackend = [&] {
+        std::shared_lock readLock(mutex_);
+        XRPL_ASSERT(
+            rotating,
+            "ripple::NodeStore::DatabaseRotatingImp::rotateWithLock rotating "
+            "flag set");
+
+        return writableBackend_;
+    }();
+
+    auto newBackend = f(writableBackend->getName());


I don't think these lambda and read lock are actually required with current implementation. We are only using write lock before (which might be switched to atomic) and after. Assuming previous synchronization block switches rotating flag, there should be no other 'write' thread being able to proceed and capture writeLock while we are here.

vvysokikh1 · 2025-02-10T13:33:58Z

src/xrpld/app/misc/SHAMapStoreImp.cpp

+
+                        clearCaches(validatedSeq);
+
+                        return std::move(newBackend);


Since you have changed the return type of rotateWithLock(), in the future this callback could be executed but false might be returned. In this case you have moved newBackend, and then clean it with setDeletePath().

Non-issue right now but maybe discard move unless you have strong perf concerns?

vvysokikh1 · 2025-02-10T13:37:27Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp

+            // This should only be reachable through unit tests.
+            XRPL_ASSERT(
+                unitTest_,
+                "ripple::NodeStore::DatabaseRotatingImp::rotateWithLock "
+                "unit testing");
+            return false;


I think this comment doesn't work. It can be reached not only with unit tests, but also by accidental concurrent call to rotateWithLock or indirect call to rotateWithLock from the callback.

vvysokikh1 · 2025-02-10T13:50:57Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.h

+    // "Shared mutexes do not support direct transition from shared to unique
+    // ownership mode: the shared lock has to be relinquished with
+    // unlock_shared() before exclusive ownership may be obtained with lock()."
+    mutable std::shared_timed_mutex mutex_;


what is the reason for choosing timed mutex here? I believe shared_mutex would be enough here

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp

ce650ad

- Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."

ximinez mentioned this pull request Feb 4, 2025

Periodically pause copying ledger nodes during online_delete #4907

Closed

2 tasks

ximinez added this to the 2.4.0 (2025) milestone Feb 4, 2025

Bronek reviewed Feb 5, 2025

View reviewed changes

src/xrpld/nodestore/DatabaseRotating.h Outdated Show resolved Hide resolved

Bronek reviewed Feb 5, 2025

View reviewed changes

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp Outdated Show resolved Hide resolved

Bronek reviewed Feb 5, 2025

View reviewed changes

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp Outdated Show resolved Hide resolved

Review feedback from @Bronek:

b8413ae

* Use a second mutex to protect the backends from modification * Remove a bunch of warning comments

ximinez requested a review from Bronek February 6, 2025 00:01

Bronek approved these changes Feb 6, 2025

View reviewed changes

Merge branch 'develop' into ximinez/db-lock

063e881

bthomee reviewed Feb 6, 2025

View reviewed changes

vvysokikh1 reviewed Feb 6, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/develop' into ximinez/db-lock

9f564bc

* upstream/develop: Updates Conan dependencies (5256)

ximinez force-pushed the ximinez/db-lock branch from 913df26 to 9f564bc Compare February 7, 2025 16:04

Review feedback from @bthomee and @vvysokikh1:

d912b50

- Rewrite the locking in DatabaseRotatingImp::rotateWithLock to use a shared_lock, and write a unit test to show (as much as possible) that it won't deadlock.

ximinez force-pushed the ximinez/db-lock branch from 13fb47c to d912b50 Compare February 7, 2025 22:18

ximinez added 3 commits February 7, 2025 17:26

Update levelization tracking

3f7fb66

Merge remote-tracking branch 'upstream/develop' into ximinez/db-lock

4de9be2

* upstream/develop: fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275) fix: issues in `simulate` RPC (5265)

fixup! Review feedback from @bthomee and @vvysokikh1:

3b6984c

vvysokikh1 reviewed Feb 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

ximinez commented Feb 4, 2025

codecov bot commented Feb 4, 2025 •

edited

Loading

bthomee Feb 6, 2025

ximinez Feb 6, 2025

ximinez Feb 7, 2025

vvysokikh1 left a comment

vvysokikh1 Feb 10, 2025

vvysokikh1 Feb 10, 2025

vvysokikh1 Feb 10, 2025

vvysokikh1 Feb 10, 2025

vvysokikh1 Feb 10, 2025

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

Are you sure you want to change the base?

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

Conversation

ximinez commented Feb 4, 2025

High Level Overview of Change

Context of Change

Type of Change

Test Plan

codecov bot commented Feb 4, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vvysokikh1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 4, 2025 •

edited

Loading