Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

column chunk page write store log message displays incorrect information #1399

Open
asfimport opened this issue Aug 13, 2014 · 3 comments
Open

Comments

@asfimport
Copy link
Collaborator

It is printing the size of the dictionary (in terms of the number of keys) twice and calling the second time the 'compressed byte count'. An accurate account of that number would be very helpful for accounting for disk space usage. The actual 'compressed byte count' is indeed calculated at a point near there so I am guessing this is a simple mistake.

see:
https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java#L152

Reporter: Ian Barfield

Note: This issue was originally created as PARQUET-71. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Dmitriy V. Ryaboy / @dvryaboy:
Here is the "writeDictionaryPage" code from ParquetFileWriter:

  /**
   * writes a dictionary page page
   * @param dictionaryPage the dictionary page
   */
  public void writeDictionaryPage(DictionaryPage dictionaryPage) throws IOException {
    state = state.write();
    if (DEBUG) LOG.debug(out.getPos() + ": write dictionary page: " + dictionaryPage.getDictionarySize() + " values");
    currentChunkDictionaryPageOffset = out.getPos();
    int uncompressedSize = dictionaryPage.getUncompressedSize();
    int compressedPageSize = (int)dictionaryPage.getBytes().size(); // TODO: fix casts
    metadataConverter.writeDictionaryPageHeader(
        uncompressedSize,
        compressedPageSize,
        dictionaryPage.getDictionarySize(),
        dictionaryPage.getEncoding(),
        out);
    long headerSize = out.getPos() - currentChunkDictionaryPageOffset;
    this.uncompressedLength += uncompressedSize + headerSize;
    this.compressedLength += compressedPageSize + headerSize;
    if (DEBUG) LOG.debug(out.getPos() + ": write dictionary page content " + compressedPageSize);
    dictionaryPage.getBytes().writeAllTo(out);
    currentEncodings.add(dictionaryPage.getEncoding());
  }

So compressedLength is compressedPageSize + headerSize. Header size is a few bytes. compressedPageSize is just dictionaryPage.getBytes.().size().

DictionaryPage.getUncompressedSize returns bytes.size(), which is the same thing as compressedPageSize.

So really there's almost no difference between compressed and uncompressed size for the dictionary – there is no special compression, and while you are right that there's a typo, we should print out dictionaryPage.getUncompressedSize() instead!

Now, this does beg the question of why we aren't feeding this through a compressor like Snappy or LZO or whatever the user set as compression for this particular Parquet file.

@asfimport
Copy link
Collaborator Author

Ian Barfield:
That doesn't sound consistent with the large differences I saw while stepping through the code to observe the actual byte counts. I'll have to take another look and get back to you.

@asfimport
Copy link
Collaborator Author

Julien Le Dem / @julienledem:
Thanks for reporting.
The log message is incorrect (Pull request welcome), but the metadata is correct.
https://github.com/apache/incubator-parquet-mr/blob/7a105068e60b7dd6e9f28dd0ccdb9b696a9bc941/parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java#L152
Those values are used to allocate buffers of the right size when reading so it should be correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant