-
Notifications
You must be signed in to change notification settings - Fork 192
PARQUET-979: Limit size of min, max or disable stats for long binary types #465
Conversation
src/parquet/column_writer.cc
Outdated
@@ -33,6 +33,8 @@ | |||
|
|||
namespace parquet { | |||
|
|||
static constexpr int MAX_STATS_SIZE = 4096; // limit stats to 4k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this defined by the Parquet standard or arbitrarily chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed parquet-mr.
https://github.com/apache/parquet-mr/blob/0d55abd05b0e5027c18e60d1ac3b22998dd00951/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L88
I don't see this specified in the spec. We should probably add it there. I will open a JIRA to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative would be to use this as sensible default and add it as an option to WriterProperties
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it to ColumnProperties
since it gives more flexibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At one place I would like to see statistics
instead of stats
for a better code readability, otherwise this is fine.
For the member variables that you have suffixed with _
, I think that this is correct but we should actually make them private and call const accessors on them in future.
src/parquet/properties.h
Outdated
return column_properties(path).statistics_enabled_; | ||
} | ||
|
||
size_t max_stats_size(const std::shared_ptr<schema::ColumnPath>& path) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use max_statistics_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
No description provided.