-
Notifications
You must be signed in to change notification settings - Fork 1
Spanner
- millions of machines
- hundreds of datacenters
- trillions of database rows
If a transaction T1 commits before another transaction T2 starts, then T1's timestamp is smaller than T2's.
Without sharding, if timestamps are assigned as specified in section 4.1.2, external consistency is achieved. The reason is that if only a single node is involved, the time on the single node can be seen as TrueTime without uncertainty.
With sharding, a TrueTime involving all the nodes are needed to ensure external consistency.
zonemaster, location proxy and spannservser are like tablet controller, router and storage unit in [PNUTS]. Placement driver does the load balancing at universe level.
The number of replicas in a Paxos group is not fixed. It is dynamically adjusted by location driver. One reason is that the paper says "The placement driver periodically communicates with the spanservers to find data that needs to be moved, either to meet updated replication constraints or to balance load."
Paxos in Spanner is Multi-Paxos.
RSM re-configuration is not supported. Witness is mentioned. My guess is that it has the similar usage as Harp and Cheap Paxos.
The paper says:
On top of the bag of key-value mappings, the Spanner implementation supports a bucketing abstraction called a directory, which is a set of contiguous keys that share a common prefix.
But in Section 2.3, the paper says:
Each row in a directory table with key K, together with all of the rows in descendant tables that start with K in lexicographic order, forms a directory.
Does a directory contains one key K or multiple Ks? The former paragraph and latter paragraph give different answers. I think that the former is correct. The reason is that is directory is 64MB. Usually a directory row and its descendant rows can't occupy such a storage space. And the granularity is too fine.
2PL and 2PC ensure serializability. As a MVCC system, each write has a associated time stamp. Actually, Spanner can just assign each write the physical local machine time. The time stamps has nothing to do with serializability. All stuff besides 2PL and 2PC aims to ensure:
a whole-database audit read at a timestamp t will see exactly the effects of every transaction that has committed as of t.
Such a read will observe all writes whose commit time time stams before Sread in absolute time.
|
v | v
[----] | [----]
|v
[----]
v|
[----]
Spanner's lease management is much complicated than GFS. Spanner's leader-lease management not only ensure of the disjointness of a live leader but also the disjointness of timesamps across intervals. In GFS, all lease management is through the single master.
Why this requirement "As a result, they can be assigned timestamps at any time when all locks have been acquired, but before any locks have been released"? t-abs(e^all-locks-acquired) <= s <= t-abs(e^any-lock-released) t-abs(e^all-locks-acquired) <= t-abs(e^computing-s) t-abs(e^computing-s) <= s s < t-abs(e^commit) t-abs(e^commit) <= t-abs(e^any-lock-released) Without this condition, the following situation might occur. Even T1 can see T2's writes, T1 timestamp s1 is less than T2's timestamp s2. lk1 lk2 lk3 T1 ---|-----|--|------------|----------- s1 lk3 commit T2 ---------|-----|---|--------------- s2 Requiremnts for commit timstamp: s >= TT.now().latest which is computed after e-i^server Actually, Spanner picks s = TT.now().latest after acquiring all locks, which is for satisfying the above requirement.
tsafe = min(tsafePaxos, tsafeTM)
Image a Paxos leader with the following state. It can not satisfy a read with t = 10
since a commit may happen at 9.
Paxos write: 2 4 6 10
Prepared transaction timestamp: 9
Wound-wait prefers old transactions. Usually, old transactions have bigger ids.
Read-write transactions happen in two-phases: read-phase and write-phase.
Spanner automatically infers the read locks (see the first paragraph of 4.2.2).
And after the read-phase is finished, spanner infers the write locks. The
difficult transaction in [Case] can be handled well by Spanner.
y = read(x) write(y)
Are there some general transactions which Spanner can't do the lock inference? It is possible that Spanner can't handle some transaction written in a extremely dynamically way. Imagine a transaction containing the following operations:
Write1
Read1
Write2
Read1
reads the result written by Write1
. And Write2
depends the read
result of Read1
. This transaction can be splitted into a read phase followed
by a write phase. One example of Read1
is to get the latest value of
auto_increment
column in MySQL.
Spanner chooses timestamps at commit point as the time stamp for transactions.
Spanner uses SS2PL (strong strict 2 phase locking). Understand why the following sentence from 4.1.2 holds for read-write transactions.
Why "greater than any timestamps the leader has assigned to previous
transactions?"
The coordinator leader (CL) receives T1. e
is 7ms when computing
s1=TT.now().lastest=13ms at 8ms. s1 is assigned to T1. Then CL synchronizes
time with time master. e
is 1ms. CL receives T2. And CL compute
s2=TT.now().latest=10 at 9ms. s2 is assigned to T2. s1 > s2 which breaks "A
single leader replica can trivially assign timestamps in monotonically
increasing order."
1 8 13 T1 ----|-------+-------|----------- 8 9 10 T2 ------------|-+-|---------------
TT.now() "TT.now().latest at the time the coordinator received its commit message" is e^server event.
As a result, they can be assigned timestamps at any time when all locks have been acquired, but before any locks have been released.
Define the instantaneous error bound as e. If the clock time is t, the absolutime time for t is in [t-e, t+e]. The max difference between any two clocks is less or equal to 2e.
the slowest clock | |-------| |<--2e->| | absolute time | |<--2e->| |-------| | the fastest clock
Pick TT.now().latest() ensures that the commit timestamp is a absolute time during the phase when all locks are held. If the local clock is chosen as the commit timestamp, external consistency is still ensured. Imagine a system with 2e=7ms. In such a system, the max difference between any two clocks is less 7ms. Use Range A to indicate the clock range when T1 pick its commit timestamp. Use Range B to indicate the clock range when T2 starts after 7ms is elapsed. Since Range A and Range B are not overlapped if local clock drifting is ingored, T2's commit timestamp is larger that T1's.
Range A
|-------| Range B
| |-------|
| |
v v
==========================> absolute time
But such a commit timestamp is not a absolute time during the all-lock-held phase. Imagine a lock is slower 3.5ms compared to the absolutime time.
clock time: 1 8
absolute time: 4.5 11.5
Pick clock at the beginning of the all-locks-held phase. in this case, it is 1ms which is not a absolute time during the all-lock-held phase.
- Single-site RO transction: use LastTS()
- Multple-site RO transaction: client picks
TT.now().latest
.
- tsateTM: a fine-grained mapping from key ranges to
prepared transaction timestamps. For the exmaple in
4.1.3
, if we know the read does not conflict with transaction related to prepared timestamp 9, we can serve with tsafe=10. - LastTS(): a fine-grained mapping from key ranges to commit timestamps.
Image a RO transaction which involves key1, key2 and key3. The mapping has the following entries:
key1: 6
key2: 4
key3: 8
max(6, 4, 8) is 8. If there is no conflicting prepared transaction, the RO transaction can be assigned timestamp 8. But if there is a prepared transaction with involves any of key1, key2 and key3. Such an optimization is not feasible.
- MinNextTS(). If MinNextTS(10) = 100, then tsafePaxos can be advanced to 99 since the next Paxos write will have a timestamp no less than 100.
I think that the value of MinNextTS() can be set in the following approach:
t_{abs}(read) < t_abs(any event after read)
s_{a new commit}
MinNextTS() = TT.now().latest() when the read happens - lock_clock_drifting_error
< TT.now().lastest() for any event after read)
MinNextTS() = TT.now().latest() when the read happens - lock_clock_drifting_error
< timestamp for any new commit
But this approach will advance s_{max} eagerly, which is not acceptable.
The paper says "As the number of replicas increases, the latency stays roughly constant with less standard deviation because Paxos executes in parallel at a group’s replicas.". Why the latency stays roughly constant? If only 1 replica is involved, no Paxos network latency. So as the number replica increases, the Paxos latency should increase. But since Paxos latency is lower-bounded by commit wait. And replicas in a Paxos group execute in parallel. So the latency stays roughly the same. Why less standard deviation? As the number of replicas increases, a Paxos write involves more machines. So the latency is decided by more machines stead of one. Why "As the number of replicas increases, the latency to achieve a quorum becomes less sensitive to slowness at one slave replica."? As the number of replicas increases, the number of concurrent failures n to fail Paxos increases. Assume that the probability of a single machine failure is p, then the probability to fail a Paxos is p^n. 6.824-2012 lab7 returns to a client request only if all replicas succeed. But Spanner only requires a quorum of replica successes.
Why the even larger standard deviation in read latencies than in write latencies? The reason is than write involves more servers and read involves less servers. What is the replication level of Colossus? The paper doest not specify it. But I think that the replication level is 1. The reason is that F1 is replicated 5 ways at tablet level. And Colossus uses Reed–Solomon error correction.
- OSDI12
- DSRG Post
- 6.824 Notes
- 6.830 Notes
- PNUTS
- [The Case for Determinism in Database Systems] (http://db.cs.yale.edu/determinism-vldb10.pdf)
- InfoQ Talk
- Spanner's Concurrency Control