Spanner

1 Introduction

millions of machines
hundreds of datacenters
trillions of database rows

External Consistency

If a transaction T1 commits before another transaction T2 starts, then T1's timestamp is smaller than T2's.

Without sharding, if timestamps are assigned as specified in section 4.1.2, external consistency is achieved. The reason is that if only a single node is involved, the time on the single node can be seen as TrueTime without uncertainty.

With sharding, a TrueTime involving all the nodes are needed to ensure external consistency.

2 Implementation

zonemaster, location proxy and spannservser are like tablet controller, router and storage unit in [PNUTS]. Placement driver does the load balancing at universe level.

The number of replicas in a Paxos group is not fixed. It is dynamically adjusted by location driver. One reason is that the paper says "The placement driver periodically communicates with the spanservers to find data that needs to be moved, either to meet updated replication constraints or to balance load."

2.1 Spanserver Software Stack

Paxos in Spanner is Multi-Paxos.

2.2 Directories and Placement

RSM re-configuration is not supported. Witness is mentioned. My guess is that it has the similar usage as Harp and Cheap Paxos.

The paper says:

On top of the bag of key-value mappings, the Spanner implementation supports a bucketing abstraction called a directory, which is a set of contiguous keys that share a common prefix.

But in Section 2.3, the paper says:

Each row in a directory table with key K, together with all of the rows in descendant tables that start with K in lexicographic order, forms a directory.

Does a directory contains one key K or multiple Ks? The former paragraph and latter paragraph give different answers. I think that the former is correct. The reason is that is directory is 64MB. Usually a directory row and its descendant rows can't occupy such a storage space. And the granularity is too fine.

2.3 Data Model

3 TrueTime

4 Concurrency Control

2PL and 2PC ensure serializability. As a MVCC system, each write has a associated time stamp. Actually, Spanner can just assign each write the physical local machine time. The time stamps has nothing to do with serializability. All stuff besides 2PL and 2PC aims to ensure:

a whole-database audit read at a timestamp t will see exactly the effects of every transaction that has committed as of t.

Such a read will observe all writes whose commit time time stams before S_read in absolute time.

        |
   v    |    v
 [----] | [----]
        |v
     [----]
       v|
      [----]

4.1 Timestamp Management

4.1.1 Paxos Leader Leases

Spanner's lease management is much complicated than GFS. Spanner's leader-lease management not only ensure of the disjointness of a live leader but also the disjointness of timesamps across intervals. In GFS, all lease management is through the single master.

4.1.2 Assigning Timestamps to RW Transactions

Why this requirement "As a result, they can be assigned timestamps at any time 
when all locks have been acquired, but before any locks have been released"?

t-abs(e^all-locks-acquired) <= s <= t-abs(e^any-lock-released)

t-abs(e^all-locks-acquired) <= t-abs(e^computing-s)
t-abs(e^computing-s) <= s
s < t-abs(e^commit)
t-abs(e^commit) <= t-abs(e^any-lock-released)

Without this condition, the following situation might occur. Even T1 can see 
T2's writes, T1 timestamp s1 is less than T2's timestamp s2.

      lk1   lk2             lk3
T1 ---|-----|--|------------|-----------
               s1
            lk3      commit
T2 ---------|-----|---|---------------
               s2

Requiremnts for commit timstamp:
  s >= TT.now().latest which is computed after e-i^server

Actually, Spanner picks s = TT.now().latest after acquiring all locks, which is 
for satisfying the above requirement.

4.1.3 Serving Reads at a Timestamp

t_safe = min(t_safe^Paxos, t_safe^TM)

Image a Paxos leader with the following state. It can not satisfy a read with t = 10 since a commit may happen at 9.

Paxos write:                     2    4    6    10
Prepared transaction timestamp:               9

4.1.4 Assigning Timestamps to RO Transactions

4.2 Details

4.2.1 Read-Write Transactions

Wound-wait prefers old transactions. Usually, old transactions have bigger ids.

What is the difference between wait-die and wound-wait?

Read-write transactions happen in two-phases: read-phase and write-phase.
Spanner automatically infers the read locks (see the first paragraph of 4.2.2).
And after the read-phase is finished, spanner infers the write locks. The difficult transaction in [Case] can be handled well by Spanner.

y = read(x)
write(y)

Are there some general transactions which Spanner can't do the lock inference? It is possible that Spanner can't handle some transaction written in a extremely dynamically way. Imagine a transaction containing the following operations:

Write1
Read1
Write2

Read1 reads the result written by Write1. And Write2 depends the read result of Read1. This transaction can be splitted into a read phase followed by a write phase. One example of Read1 is to get the latest value of auto_increment column in MySQL.

Spanner chooses timestamps at commit point as the time stamp for transactions.

Spanner uses SS2PL (strong strict 2 phase locking). Understand why the following sentence from 4.1.2 holds for read-write transactions.

Why "greater than any timestamps the leader has assigned to previous transactions?" The coordinator leader (CL) receives T1. e is 7ms when computing s1=TT.now().lastest=13ms at 8ms. s1 is assigned to T1. Then CL synchronizes time with time master. e is 1ms. CL receives T2. And CL compute s2=TT.now().latest=10 at 9ms. s2 is assigned to T2. s1 > s2 which breaks "A single leader replica can trivially assign timestamps in monotonically increasing order."

       1       8       13
T1 ----|-------+-------|-----------
               8 9 10
T2 ------------|-+-|---------------

TT.now() "TT.now().latest at the time the coordinator received its commit message" is e^server event.

As a result, they can be assigned timestamps at any time when all locks have 
been acquired, but before any locks have been released.

Define the instantaneous error bound as e. If the clock time is t, the absolutime time for t is in [t-e, t+e]. The max difference between any two clocks is less or equal to 2e.

 the slowest clock
        |
    |-------|
    |<--2e->|
            | absolute time
            |
            |<--2e->|
            |-------|
                |
          the fastest clock

Pick TT.now().latest() ensures that the commit timestamp is a absolute time during the phase when all locks are held. If the local clock is chosen as the commit timestamp, external consistency is still ensured. Imagine a system with 2e=7ms. In such a system, the max difference between any two clocks is less 7ms. Use Range A to indicate the clock range when T1 pick its commit timestamp. Use Range B to indicate the clock range when T2 starts after 7ms is elapsed. Since Range A and Range B are not overlapped if local clock drifting is ingored, T2's commit timestamp is larger that T1's.

  Range A
 |-------| Range B
     |    |-------|
     |        |
     v        v
==========================> absolute time

But such a commit timestamp is not a absolute time during the all-lock-held phase. Imagine a lock is slower 3.5ms compared to the absolutime time.

clock time:     1     8
absolute time:  4.5   11.5

Pick clock at the beginning of the all-locks-held phase. in this case, it is 1ms which is not a absolute time during the all-lock-held phase.

4.2.2 Read-Only Transactions

Single-site RO transction: use LastTS()
Multple-site RO transaction: client picks TT.now().latest.

4.2.3 Schema-Change Transactions

4.4.4 Refinements

t_sate^TM: a fine-grained mapping from key ranges to prepared transaction timestamps. For the exmaple in 4.1.3, if we know the read does not conflict with transaction related to prepared timestamp 9, we can serve with t_safe=10.
LastTS(): a fine-grained mapping from key ranges to commit timestamps.

Image a RO transaction which involves key1, key2 and key3. The mapping has the following entries:

key1: 6
key2: 4
key3: 8

max(6, 4, 8) is 8. If there is no conflicting prepared transaction, the RO transaction can be assigned timestamp 8. But if there is a prepared transaction with involves any of key1, key2 and key3. Such an optimization is not feasible.

MinNextTS(). If MinNextTS(10) = 100, then t_safe^Paxos can be advanced to 99 since the next Paxos write will have a timestamp no less than 100.

I think that the value of MinNextTS() can be set in the following approach:

t_{abs}(read) < t_abs(any event after read)
s_{a new commit} 

MinNextTS() = TT.now().latest() when the read happens - lock_clock_drifting_error
< TT.now().lastest() for any event after read) 
MinNextTS() = TT.now().latest() when the read happens - lock_clock_drifting_error
< timestamp for any new commit

But this approach will advance s_{max} eagerly, which is not acceptable.

5 Evaluation

5.1 Microbenchmarks

The paper says "As the number of replicas increases, the latency stays roughly constant 
with less standard deviation because Paxos executes in parallel at a group’s 
replicas.". 
  Why the latency stays roughly constant? 
    If only 1 replica is involved, no Paxos network latency. So as the number 
    replica increases, the Paxos latency should increase. But since Paxos 
    latency is lower-bounded by commit wait. And replicas in a Paxos group 
    execute in parallel. So the latency stays roughly the same.
  Why less standard deviation?
    As the number of replicas increases, a Paxos write involves more machines.  
    So the latency is decided by more machines stead of one.

Why "As the number of replicas increases, the latency to achieve a quorum 
becomes less sensitive to slowness at one slave replica."?
  As the number of replicas increases, the number of concurrent failures n to 
  fail Paxos increases. Assume that the probability of a single machine failure 
  is p, then the probability to fail a Paxos is p^n.
  6.824-2012 lab7 returns to a client request only if all replicas succeed. But 
  Spanner only requires a quorum of replica successes.

5.4 F1

Why the even larger standard deviation in read latencies than in write latencies?
  The reason is than write involves more servers and read involves less servers.
What is the replication level of Colossus?
  The paper doest not specify it. But I think that the replication level is 1. 
  The reason is that F1 is replicated 5 ways at tablet level. And Colossus uses 
  Reed–Solomon error correction.

References

OSDI12
DSRG Post
6.824 Notes
6.830 Notes
PNUTS
[The Case for Determinism in Database Systems] (http://db.cs.yale.edu/determinism-vldb10.pdf)
InfoQ Talk
Spanner's Concurrency Control

Provide feedback

Saved searches

Use saved searches to filter your results more quickly