Skip to content
This repository has been archived by the owner on May 10, 2024. It is now read-only.

PARQUET-979: Limit size of min, max or disable stats for long binary types #465

Closed
wants to merge 4 commits into from

Conversation

majetideepak
Copy link

No description provided.

@majetideepak majetideepak changed the title PARQUET-979: [C++] Limit size of min, max or disable stats for long binary types PARQUET-979: Limit size of min, max or disable stats for long binary types May 17, 2018
@@ -33,6 +33,8 @@

namespace parquet {

static constexpr int MAX_STATS_SIZE = 4096; // limit stats to 4k
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this defined by the Parquet standard or arbitrarily chosen?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed parquet-mr.
https://github.com/apache/parquet-mr/blob/0d55abd05b0e5027c18e60d1ac3b22998dd00951/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L88
I don't see this specified in the spec. We should probably add it there. I will open a JIRA to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative would be to use this as sensible default and add it as an option to WriterProperties.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it to ColumnProperties since it gives more flexibility.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one place I would like to see statistics instead of stats for a better code readability, otherwise this is fine.

For the member variables that you have suffixed with _, I think that this is correct but we should actually make them private and call const accessors on them in future.

return column_properties(path).statistics_enabled_;
}

size_t max_stats_size(const std::shared_ptr<schema::ColumnPath>& path) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use max_statistics_size

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants