Ignore empty (parquet) files when using ListingTable #13737

Blizzara · 2024-12-11T20:03:02Z

Is your feature request related to a problem or challenge?

We're using ListingTable with an object_store. Sometimes our input dataset may contain empty parquet files (like literally empty as in being 0 bytes in length). Our spark-based codepaths succeed in "reading" those files (skipping them), but DataFusion fails hard:

ParquetError(EOF("file size of 0 is less than footer"))

This error is presumably thrown by https://github.com/apache/arrow-rs/blob/06a015770098a569b67855dfaa18bdfa7c18ff92/parquet/src/file/metadata/reader.rs#L543.

I think a possible fix/improvement would be to filter out empty files for example in

datafusion/datafusion/core/src/datasource/listing/helpers.rs

Line 173 in 28e4c64

let result = store.list_with_delimiter(prefix).await?;

and

datafusion/datafusion/core/src/datasource/listing/url.rs

Line 265 in 28e4c64

.map_err(DataFusionError::ObjectStore)

with something like

.try_filter(|object_meta| object_meta.size > 0))

This would align with Spark: https://github.com/apache/spark/blob/b2c8b3069ef4f5288a5964af0da6f6b23a769e6b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L82C9-L82C23

Thoughts? Alternatively, I can fork ListingTable internally if this isn't something we want in upstream, or I'm also open to other ideas? 😄

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

findepi · 2024-12-12T12:22:36Z

We're using ListingTable with an object_store. Sometimes our input dataset may contain empty parquet files (like literally empty as in being 0 bytes in length). Our spark-based codepaths succeed in "reading" those files (skipping them),

I don't know whether Parquet spec allows such files, but i've seen Hive doing this trick (for ORC and probably Parquet as well). IIRC, the easiest way to get such files is to have a bucketed table and force Hive to create all buckets but provide data only for some buckets. I don't remember whether other engines create such files too.

TL;DR yes, I believe we should skip those, not fail.

Blizzara · 2024-12-12T22:54:14Z

Thanks @findepi , I filed a PR here: #13750!

Blizzara added the enhancement New feature or request label Dec 11, 2024

Blizzara mentioned this issue Dec 12, 2024

fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema #13750

Merged

goldmedal closed this as completed in #13750 Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore empty (parquet) files when using ListingTable #13737

Ignore empty (parquet) files when using ListingTable #13737

Blizzara commented Dec 11, 2024

findepi commented Dec 12, 2024

Blizzara commented Dec 12, 2024

Ignore empty (parquet) files when using ListingTable #13737

Ignore empty (parquet) files when using ListingTable #13737

Comments

Blizzara commented Dec 11, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

findepi commented Dec 12, 2024

Blizzara commented Dec 12, 2024