- Feature Name: replica_batch
- Status: completed
- Start Date: 2015-08-11
- RFC PR: #2340
- Cockroach Issue:
Assuming #1998 in place, replace roachpb.Request
by roachpb.BatchRequest
throughout most of the main execution path starting in Store.ExecuteCmd()
.
#1998 introduces gateway
changes after the implementation of which only BatchRequest
is received by
(*Store).ExecuteCmd()
. The changes described here allow BatchRequest
to
be submitted to Raft and executed in bulk, which should give significant
performance improvements, in particular due to the lower amount of Raft-related
round-trip delays.
The required changes to the code are plenty.
The sections below follow the main request path and outline the necessary changes in each.
This carries out
- request verification and clock updates, which generalize in a relatively straightforward manner.
- the store retry loop, and particularly, write intent error handling. This
depends on the request which ran into the intent (and whether it's a read
or write), so
Replica
needs to return not only an error, but also, for example, the associated index of the request in theBatch
.
Control flow currently splits up into read, write and admin paths. For simplicity,
allowing Admin
commands only as single elements of a Batch
, we can keep the
admin path intact. Regarding the read/write path, there are two options:
- splitting the
Batch
into sub-batches which are completely read or write only. This has the advantage of possibly less changes in the read and write paths, but requires multiple Raft proposals when reads and writes mix (in the worst case scenario,len(Batch)-1
of them). Having to bookkeep multiple Raft proposals for a singleBatch
is a disadvantage and raises questions about atomicity and response cache handling. - keeping the Batch whole, but merging
(*Replica).add{ReadOnly,Write}Cmd
. The idea is that if we need to go through Raft (i.e. if theBatch
contains at least one write) anyway, we propose the wholeBatch
and satisfy the reads throughRaft
. If theBatch
is read-only, it executes directly. It should be possible to refactor such that the code which executes reads is shared.
Overall, option two seems preferable. As a byproduct, it would make INCONSISTENT
reads consistent for free when they're part of a mutating batch anyway, and
(almost) implement CONSENSUS
reads.
(*Replica).{begin,end}Cmd
are changed to operate on Batch
(instead of
roachpb.RequestHeader
), obviating the readOnly
flag (which is determined
from the request type). The entries are added to the command queue in bulk
so that overlaps are resolved gracefully: reading [a,c)
and then writing
b
should add [a,b)
and [b\x00,c)
for reading, and b
for writing.
There is likely some potential for refactoring with intersectIntents()
.
Timestamp cache handling is straightforward, except when commands within
the same Batch
overlap: In that case, if the former is a read and the latter
a write, the latter command's timestamp must be moved past the former.
Note that there is some special-casing regarding the write timestamp cache with
Transaction
s: transactional writes are still carried out even if they're
incompatible with prior writes' timestamps. This allows Txn
s to write over
their own data, and to attempt to push in more cases.
no noteworthy changes.
roachpb.ResponseWithError
changes to roachpb.ResponsesWithError
which also
contains the index of the first error, if any (or, alternatively, by
convention the error occurred at index len(rwe.Responses)
).
Returns []roachpb.Response
, one for each successfully executed request (in
Batch
order).
same as applyRaftCommand
. This actually unwinds the Batch
, calling
(*Replica).executeCmd
sequentially until done or an error occurs.