Skip to content

Commit

Permalink
Careful SMP locking - Fix very occasional hangs
Browse files Browse the repository at this point in the history
Louis Zulli reported that Stockfish suffers from very occasional hangs with his 20 cores machine.

Careful SMP debugging revealed that this was caused by "a ghost split point slave", where thread
was marked as a split point slave, but wasn't actually working on it.

The only logical explanation for this was double booking, where due to SMP race, the same thread
is booked for two different split points simultaneously.

Due to very intermittent nature of the problem, we can't say exactly how this happens.

The current handling of Thread specific variables is risky though. Volatile variables are in some
cases changed without spinlock being hold. In this case standard doesn't give us any kind of
guarantees about how the updated values are propagated to other threads.

We resolve the situation by enforcing very strict locking rules:
- Values for key thread variables (splitPointsSize, activeSplitPoint, searching)
can only be changed when the thread specific spinlock is held.
- Structural changes for splitPoints[] are only allowed when the thread specific spinlock is held.
- Thread booking decisions (per split point) can only be done when the thread specific spinlock is held.

With these changes hangs didn't occur anymore during 2 days torture testing on Zulli's machine.

We probably have a slight performance penalty in SMP mode due to more locking.

STC (7 threads):
ELO: -1.00 +-2.2 (95%) LOS: 18.4%
Total: 30000 W: 4538 L: 4624 D: 20838

However stability is worth more than 1-2 ELO points in this case.

No functional change

Resolves official-stockfish#422
  • Loading branch information
zamar committed Sep 10, 2015
1 parent 3e2591d commit 613dc66
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 2 deletions.
2 changes: 2 additions & 0 deletions src/search.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1622,6 +1622,7 @@ void Thread::idle_loop() {
else
assert(false);

spinlock.acquire();
assert(searching);

searching = false;
Expand All @@ -1633,6 +1634,7 @@ void Thread::idle_loop() {
// After releasing the lock we can't access any SplitPoint related data
// in a safe way because it could have been released under our feet by
// the sp master.
spinlock.release();
sp->spinlock.release();

// Try to late join to another split point if none of its slaves has
Expand Down
8 changes: 6 additions & 2 deletions src/thread.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,7 @@ void Thread::split(Position& pos, Stack* ss, Value alpha, Value beta, Value* bes
SplitPoint& sp = splitPoints[splitPointsSize];

sp.spinlock.acquire(); // No contention here until we don't increment splitPointsSize
spinlock.acquire();

sp.master = this;
sp.parentSplitPoint = activeSplitPoint;
Expand All @@ -167,6 +168,7 @@ void Thread::split(Position& pos, Stack* ss, Value alpha, Value beta, Value* bes
++splitPointsSize;
activeSplitPoint = &sp;
activePosition = nullptr;
spinlock.release();

// Try to allocate available threads
Thread* slave;
Expand Down Expand Up @@ -194,6 +196,9 @@ void Thread::split(Position& pos, Stack* ss, Value alpha, Value beta, Value* bes

Thread::idle_loop(); // Force a call to base class idle_loop()

sp.spinlock.acquire();
spinlock.acquire();

// In the helpful master concept, a master can help only a sub-tree of its
// split point and because everything is finished here, it's not possible
// for the master to be booked.
Expand All @@ -205,15 +210,14 @@ void Thread::split(Position& pos, Stack* ss, Value alpha, Value beta, Value* bes
// We have returned from the idle loop, which means that all threads are
// finished. Note that decreasing splitPointsSize must be done under lock
// protection to avoid a race with Thread::can_join().
sp.spinlock.acquire();

--splitPointsSize;
activeSplitPoint = sp.parentSplitPoint;
activePosition = &pos;
pos.set_nodes_searched(pos.nodes_searched() + sp.nodes);
*bestMove = sp.bestMove;
*bestValue = sp.bestValue;

spinlock.release();
sp.spinlock.release();
}

Expand Down

0 comments on commit 613dc66

Please sign in to comment.