forked from linkedin/goavro
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Explain why schema evolution is not really supported by Avro 1.x.
- Loading branch information
Karrick S. McDermott
committed
Jun 13, 2019
1 parent
c81d420
commit 2b87e8a
Showing
1 changed file
with
84 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
From the Avro specification: | ||
|
||
default: A default value for this field, used when reading instances | ||
that lack this field (optional). Permitted values depend on the | ||
field's schema type, according to the table below. Default values for | ||
union fields correspond to the first schema in the union. Default | ||
values for bytes and fixed fields are JSON strings, where Unicode code | ||
points 0-255 are mapped to unsigned 8-bit byte values 0-255. I read | ||
the above to mean that the purpose of default values are to allow | ||
reading Avro data that was written without the fields, and not | ||
necessarily augmentation of data being serialized. So in general I | ||
agree with you in terms of purpose. | ||
|
||
One very important aspect of Avro is that the schema used to serialize | ||
the data should always remain with the data, so that a reader would | ||
always be able to read the schema and then be able to consume the | ||
data. I think most people still agree so far. | ||
|
||
However, this is where things get messy. Schema evolution is | ||
frequently cited when folks want to use a new version of the schema to | ||
read data that was once written using an older version of that schema. | ||
I do not believe the Avro specification properly handles schema | ||
evolution. Here's a simple example: | ||
|
||
Record v0: | ||
name: string | ||
nickname: string, default: "" | ||
|
||
Record v1: | ||
name: string | ||
nickname: string, default: "" | ||
title: string, default: "" | ||
|
||
Okay, now a binary stream of records is just a bunch of strings. Let's | ||
do that now. | ||
|
||
0x0A, A, l, i, c, e, 0x06, B, o, b, 0x0A, B, r, u, c, e, 0x0A, S, a, l, l, y, 0x06, A, n, n | ||
|
||
How many records is that? It could be as many as 5 records, each of a | ||
single name and no nicknames. It could be as few as 2 records, one of | ||
them with a nickname and a title, and one with only a nickname, or a | ||
title. | ||
|
||
Now to drive home the nail that Avro schema evolution is broken, even | ||
if each record had a header that indicated how many bytes it would | ||
consume, we could know where one record began and ended, and how many | ||
records there are. But if we were to read a record with two strings | ||
in it, is the second string the nickname or the title? | ||
|
||
The Avro specification has no answer to that question, so neither do I. | ||
|
||
Effectively, Avro could be a great tool for serializing complex data, | ||
but it's broken in its current form, and to fix it would require it to | ||
break compatibility with itself, effectively rendering any binary data | ||
serialized in a previous version of Avro unreadable by new versions, | ||
unless it had some sort of version marker on the data so a library | ||
could branch. | ||
|
||
One great solution would be augmenting the binary encoding with a | ||
simple field number identifier. Let's imagine an Avro 2.x that had | ||
this feature, and would support schema evolution. Here's an example | ||
stream of bytes that could be unambiguously decoded using the new | ||
schema: | ||
|
||
0x02, 0x0A, A, l, i, c, e, 0x02, 0x06, B, o, B, 0x04, 0x0A, B, r, u, c, e, 0x02, 0x0C, C, h, a, r, l, i, e, 0x06, 0x04, M, r | ||
|
||
In the above example of my fake Avro 2.0, this can be | ||
deterministically decoded because 0x02 indicates the following is | ||
field number 1 (name), followed by string length 5, followed by | ||
Alice. | ||
|
||
Then the decoder would see 0x02, marking field number 1 again, | ||
which means, "next record", followed by string length 3, followed by | ||
Bob, followed by 0x04, which means field number 2 (nickname), followed | ||
by string length 5, followed by Bruce. | ||
|
||
Followed by field number 1 (next record), followed by string length 6, | ||
followed by Charlie, followed by field number 3 (title), followed by | ||
string length 2, followed by Mr. | ||
|
||
In my hypothetical version of Avro 2, Avro can cope with schema | ||
evolution using record defaults and such. Sadly, Avro 1.x cannot and | ||
thus we should avoid using it if your use-case requires schema | ||
evolution. |