You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[edit] I've hijacked the "Refactor VortexFileWriter" issue to cover the broader work towards vortex layouts.
The description here is WIP
Vortex Layouts
Vortex Layouts should be seen as symmetrical to Vortex Arrays, except that the data is stored externally and they can be lazily queried with push-down pruning. They are fully independent of the storage system, requiring only a key(u32)/value(bytes) API.
A secondary benefit of separating the layout logic from storage, is that we remove all I/O from this code path allowing us to have much better control over when, where and how we interact with I/O (as well as removing all traces of async Rust from the layout logic).
Built-in Layouts
Our initial prototype ships with three layouts:
Flat - an atomic array (of any DType). The serialized form does not include the dtype (to prevent duplication).
Chunked - row-based partitioning of an array.
Struct - col-based partitioning of a struct array.
Additional layouts for the future:
Dictionary - pull out a shared dictionary across another child layout, e.g. store values as flat layout, store codes in a chunked layout to use one dictionary across all chunks of a column.
FlatStruct - same as Struct, except with flattened nested fields.
As with everything in Vortex, these will be extensible by consumers of the Rust library.
Layout Strategies
The power of layouts is in how we choose them. I think by default Vortex will ship with a StructOfChunks layout strategy that first splits into columns, then each column is independently chunked based on a target byte size (rather than row-count). This can be pretty useful when storing columns with large values. It means the returned batches are sized based on the data that's scanned, and doesn't suffer the write-time skew that e.g. Parquet might introduce with its "ChunksOfStruct" (row-groups of columns) strategy.
A LayoutStrategy is fed a series of ArrayData batches that represent a row-wise chunk of an array. Each strategy may manipulate, buffer, split, this batch however it likes and pass it to some child strategies, eventually culminating in a FlatStrategy that simply serializes the data. At the end of this process, the layout strategy returns the LayoutData (remember the analogy to ArrayData?).
Layout Scan
Scanning a layout is sort of analogous to an array compute function. It's probably the only "compute function" we support on layouts for a while (maybe we support sum/min/max etc?).
From a LayoutData, we construct a LayoutScan, and from this we construct a RangeScan per horizontal "split" of the file that we wish to read. This two-level approach is so for a single scan operator layouts can share some state across ranges, e.g. DictLayout would want to share the canonicalized values array.
The old issue:
Tracking a bunch of breaking changes to the format and refactoring of the writer
VortexFileWriter to pass chunks into an abstract LayoutWriter.
We ship a default LayoutWriter that is struct-of-chunked.
Impl LayoutWriter for struct, chunked, flat, and later dict layouts.
Layout flat buffer to hold BlockIds instead of offset/length. This allows Layout to be abstract over the storage format and not assume linear bytes (e.g. can store in an arbitrary block device).
Postscript to point to DType + FileLayout and not assume contiguous bytes.
Ensure we don't write extra padding / framing, e.g. we sometimes use IPCMessage and MessageWriter for writing to the file, even though this adds framing for a streaming format.
Allow configurable padding in the file to support mem-mapping with zero-copy.
Initial implementation of the new structure of vortex layouts per #1676
* Only flat layout works.
* I'm not 100% sure on the trait APIs, these will evolve as we pad out
the implementation.
* StructLayout will be worked on as part of #1782 so will probably come
last.
Up next:
* Implementation of ChunkedLayout
Open Questions:
* What is the API that e.g. Python users have to precisely configure
layout strategies? Can I override a layout writer for a specific field?
* Similarly, how can we configure the layout scanners? Can I configure a
level 0 chunked layout differently from level 2 in a
chunk-of-struct-of-chunk world?
[edit] I've hijacked the "Refactor VortexFileWriter" issue to cover the broader work towards vortex layouts.
The description here is WIP
Vortex Layouts
Vortex Layouts should be seen as symmetrical to Vortex Arrays, except that the data is stored externally and they can be lazily queried with push-down pruning. They are fully independent of the storage system, requiring only a key(u32)/value(bytes) API.
A secondary benefit of separating the layout logic from storage, is that we remove all I/O from this code path allowing us to have much better control over when, where and how we interact with I/O (as well as removing all traces of async Rust from the layout logic).
Built-in Layouts
Our initial prototype ships with three layouts:
Additional layouts for the future:
As with everything in Vortex, these will be extensible by consumers of the Rust library.
Layout Strategies
The power of layouts is in how we choose them. I think by default Vortex will ship with a
StructOfChunks
layout strategy that first splits into columns, then each column is independently chunked based on a target byte size (rather than row-count). This can be pretty useful when storing columns with large values. It means the returned batches are sized based on the data that's scanned, and doesn't suffer the write-time skew that e.g. Parquet might introduce with its "ChunksOfStruct" (row-groups of columns) strategy.A
LayoutStrategy
is fed a series ofArrayData
batches that represent a row-wise chunk of an array. Each strategy may manipulate, buffer, split, this batch however it likes and pass it to some child strategies, eventually culminating in a FlatStrategy that simply serializes the data. At the end of this process, the layout strategy returns theLayoutData
(remember the analogy toArrayData
?).Layout Scan
Scanning a layout is sort of analogous to an array compute function. It's probably the only "compute function" we support on layouts for a while (maybe we support sum/min/max etc?).
From a
LayoutData
, we construct aLayoutScan
, and from this we construct aRangeScan
per horizontal "split" of the file that we wish to read. This two-level approach is so for a single scan operator layouts can share some state across ranges, e.g. DictLayout would want to share the canonicalized values array.The old issue:
Tracking a bunch of breaking changes to the format and refactoring of the writer
The text was updated successfully, but these errors were encountered: