Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit serde of array metadata #396

Open
2 of 21 tasks
gatesn opened this issue Jun 20, 2024 · 6 comments
Open
2 of 21 tasks

Audit serde of array metadata #396

gatesn opened this issue Jun 20, 2024 · 6 comments
Assignees

Comments

@gatesn
Copy link
Contributor

gatesn commented Jun 20, 2024

We currently implement naive serde using Rust serde + flexbuffers by default.
Many arrays can pack their metadata much more tightly.
This is an overview issue to track auditing each one:

  • Bool
  • Chunked
  • Constant
  • Datetime
  • Extension
  • Primitive
  • Sparse
  • Struct
  • VarBin
  • VarBinView
  • ALP
  • Datetime Parts
  • Dict
  • BitPacking
  • FoR
  • FSST
  • Delta
  • RoaringInt
  • RoaringBool
  • RunEnd
  • ZigZag
@lwwmanning lwwmanning changed the title Fix Array Metadata Audit Array Metadata SerDe Jun 20, 2024
@lwwmanning lwwmanning changed the title Audit Array Metadata SerDe Audit serde of array metadata Jun 20, 2024
@danking danking self-assigned this Sep 20, 2024
@danking
Copy link
Member

danking commented Sep 20, 2024

I'm gonna try making the Validity metadata for Structs much smaller.

@danking
Copy link
Member

danking commented Sep 20, 2024

We might eventually want to squeeze all metadata into 32-bits. We can reserve 0xffffffff to indicate that the metadata has spilled into a buffer.

@robert3005
Copy link
Member

I think we can spare 64 bits per encoding

@gatesn
Copy link
Contributor Author

gatesn commented Sep 21, 2024

For most arrays, validity metadata is just a single bit for whether or not a validity child is defined.

@danking
Copy link
Member

danking commented Sep 30, 2024

RunEnd

remove length, dtype => ptype (it has to be an int).

pub struct RunEndMetadata {
    validity: ValidityMetadata,
    ends_dtype: DType,
    num_runs: usize,
    offset: usize,
    length: usize,
}

ALP

pub struct ALPMetadata {
    exponents: Exponents,
    encoded_dtype: DType,
    patches_dtype: Option<DType>,
}

RunEndBool

remove length, dtype => ptype.

pub struct RunEndBoolMetadata {
    start: bool,
    validity: ValidityMetadata,
    ends_dtype: DType,
    num_runs: usize,
    offset: usize,
    length: usize,
}

RoaringInt

pub struct RoaringIntMetadata {
    ptype: PType,
}

FoR

Scalar => ScalarValue, use self.dtype(). Buffer, BufferString, List should go into the Array buffer.

pub struct FoRMetadata {
    reference: Scalar,
    shift: u8,
}

Dict

DType => PType

pub struct DictMetadata {
    codes_dtype: DType,
    values_len: usize,
}

DateTimeParts

DType => PType.

pub struct DateTimePartsMetadata {
    days_dtype: DType,
    seconds_dtype: DType,
    subseconds_dtype: DType,
}

FSST

DType => PType.

pub struct FSSTMetadata {
    symbols_len: usize,
    codes_dtype: DType,
    uncompressed_lengths_dtype: DType,
}

Null

remove len.

pub struct NullMetadata {
    len: usize,
}

Primitive

pub struct PrimitiveMetadata {
    validity: ValidityMetadata,
}

VarBin

DType => PType

pub struct VarBinMetadata {
    validity: ValidityMetadata,
    offsets_dtype: DType,
    bytes_len: usize,
}

Delta

pub struct DeltaMetadata {
    validity: ValidityMetadata,
    deltas_len: usize,
    offset: usize, // must be <1024
}

RoaringBool

Remove length

pub struct RoaringBoolMetadata {
    length: usize,
}

BitPacked

Remove length.

pub struct BitPackedMetadata {
    validity: ValidityMetadata,
    bit_width: usize,
    offset: usize, // Know to be <1024
    length: usize, // Store end padding instead <1024
    has_patches: bool,
}

ByteBool

pub struct ByteBoolMetadata {
    validity: ValidityMetadata,
}

ZigZag

pub struct ZigZagMetadata

Extension

DType => PType

pub struct ExtensionMetadata {
    storage_dtype: DType,
}

Struct

Remove length.

pub struct StructMetadata {
    length: usize,
    validity: ValidityMetadata,
}

Chunked

pub struct ChunkedMetadata {
    num_chunks: usize,
}

Sparse

remove len, DType => PType, Scalar => ScalarValue.

pub struct SparseMetadata {
    indices_dtype: DType,
    // Offset value for patch indices as a result of slicing
    indices_offset: usize,
    indices_len: usize,
    len: usize,
    fill_value: Scalar,
}

Constant

Scalar => ScalarValue, remove length.

pub struct ConstantMetadata {
    scalar: Scalar,
    length: usize,
}

Bool

Remove length.

pub struct BoolMetadata {
    validity: ValidityMetadata,
    length: usize,
    bit_offset: usize,
}

VarBinView

pub struct VarBinViewMetadata {
    validity: ValidityMetadata,
    data_lens: Vec<usize>,
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants