You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is printing the size of the dictionary (in terms of the number of keys) twice and calling the second time the 'compressed byte count'. An accurate account of that number would be very helpful for accounting for disk space usage. The actual 'compressed byte count' is indeed calculated at a point near there so I am guessing this is a simple mistake.
So compressedLength is compressedPageSize + headerSize. Header size is a few bytes. compressedPageSize is just dictionaryPage.getBytes.().size().
DictionaryPage.getUncompressedSize returns bytes.size(), which is the same thing as compressedPageSize.
So really there's almost no difference between compressed and uncompressed size for the dictionary – there is no special compression, and while you are right that there's a typo, we should print out dictionaryPage.getUncompressedSize() instead!
Now, this does beg the question of why we aren't feeding this through a compressor like Snappy or LZO or whatever the user set as compression for this particular Parquet file.
Ian Barfield:
That doesn't sound consistent with the large differences I saw while stepping through the code to observe the actual byte counts. I'll have to take another look and get back to you.
It is printing the size of the dictionary (in terms of the number of keys) twice and calling the second time the 'compressed byte count'. An accurate account of that number would be very helpful for accounting for disk space usage. The actual 'compressed byte count' is indeed calculated at a point near there so I am guessing this is a simple mistake.
see:
https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java#L152
Reporter: Ian Barfield
Note: This issue was originally created as PARQUET-71. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: