column chunk page write store log message displays incorrect information #1399

asfimport · 2014-08-13T11:20:33Z

It is printing the size of the dictionary (in terms of the number of keys) twice and calling the second time the 'compressed byte count'. An accurate account of that number would be very helpful for accounting for disk space usage. The actual 'compressed byte count' is indeed calculated at a point near there so I am guessing this is a simple mistake.

see:
https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java#L152

Reporter: Ian Barfield

_{Note: This issue was originally created as PARQUET-71. Please see the migration documentation for further details.}

asfimport · 2014-08-29T00:44:21Z

Dmitriy V. Ryaboy / @dvryaboy:
Here is the "writeDictionaryPage" code from ParquetFileWriter:

  /**
   * writes a dictionary page page
   * @param dictionaryPage the dictionary page
   */
  public void writeDictionaryPage(DictionaryPage dictionaryPage) throws IOException {
    state = state.write();
    if (DEBUG) LOG.debug(out.getPos() + ": write dictionary page: " + dictionaryPage.getDictionarySize() + " values");
    currentChunkDictionaryPageOffset = out.getPos();
    int uncompressedSize = dictionaryPage.getUncompressedSize();
    int compressedPageSize = (int)dictionaryPage.getBytes().size(); // TODO: fix casts
    metadataConverter.writeDictionaryPageHeader(
        uncompressedSize,
        compressedPageSize,
        dictionaryPage.getDictionarySize(),
        dictionaryPage.getEncoding(),
        out);
    long headerSize = out.getPos() - currentChunkDictionaryPageOffset;
    this.uncompressedLength += uncompressedSize + headerSize;
    this.compressedLength += compressedPageSize + headerSize;
    if (DEBUG) LOG.debug(out.getPos() + ": write dictionary page content " + compressedPageSize);
    dictionaryPage.getBytes().writeAllTo(out);
    currentEncodings.add(dictionaryPage.getEncoding());
  }

So compressedLength is compressedPageSize + headerSize. Header size is a few bytes. compressedPageSize is just dictionaryPage.getBytes.().size().

DictionaryPage.getUncompressedSize returns bytes.size(), which is the same thing as compressedPageSize.

So really there's almost no difference between compressed and uncompressed size for the dictionary – there is no special compression, and while you are right that there's a typo, we should print out dictionaryPage.getUncompressedSize() instead!

Now, this does beg the question of why we aren't feeding this through a compressor like Snappy or LZO or whatever the user set as compression for this particular Parquet file.

asfimport · 2014-08-29T01:32:02Z

Ian Barfield:
That doesn't sound consistent with the large differences I saw while stepping through the code to observe the actual byte counts. I'll have to take another look and get back to you.

asfimport · 2014-08-29T23:48:27Z

Julien Le Dem / @julienledem:
Thanks for reporting.
The log message is incorrect (Pull request welcome), but the metadata is correct.
https://github.com/apache/incubator-parquet-mr/blob/7a105068e60b7dd6e9f28dd0ccdb9b696a9bc941/parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java#L152
Those values are used to allocate buffers of the right size when reading so it should be correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

column chunk page write store log message displays incorrect information #1399

column chunk page write store log message displays incorrect information #1399

asfimport commented Aug 13, 2014

asfimport commented Aug 29, 2014

asfimport commented Aug 29, 2014

asfimport commented Aug 29, 2014

column chunk page write store log message displays incorrect information #1399

column chunk page write store log message displays incorrect information #1399

Comments

asfimport commented Aug 13, 2014

asfimport commented Aug 29, 2014

asfimport commented Aug 29, 2014

asfimport commented Aug 29, 2014