You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
We're using ListingTable with an object_store. Sometimes our input dataset may contain empty parquet files (like literally empty as in being 0 bytes in length). Our spark-based codepaths succeed in "reading" those files (skipping them), but DataFusion fails hard:
ParquetError(EOF("file size of 0 is less than footer"))
We're using ListingTable with an object_store. Sometimes our input dataset may contain empty parquet files (like literally empty as in being 0 bytes in length). Our spark-based codepaths succeed in "reading" those files (skipping them),
I don't know whether Parquet spec allows such files, but i've seen Hive doing this trick (for ORC and probably Parquet as well). IIRC, the easiest way to get such files is to have a bucketed table and force Hive to create all buckets but provide data only for some buckets. I don't remember whether other engines create such files too.
TL;DR yes, I believe we should skip those, not fail.
Is your feature request related to a problem or challenge?
We're using ListingTable with an object_store. Sometimes our input dataset may contain empty parquet files (like literally empty as in being 0 bytes in length). Our spark-based codepaths succeed in "reading" those files (skipping them), but DataFusion fails hard:
This error is presumably thrown by https://github.com/apache/arrow-rs/blob/06a015770098a569b67855dfaa18bdfa7c18ff92/parquet/src/file/metadata/reader.rs#L543.
I think a possible fix/improvement would be to filter out empty files for example in
datafusion/datafusion/core/src/datasource/listing/helpers.rs
Line 173 in 28e4c64
datafusion/datafusion/core/src/datasource/listing/url.rs
Line 265 in 28e4c64
with something like
This would align with Spark: https://github.com/apache/spark/blob/b2c8b3069ef4f5288a5964af0da6f6b23a769e6b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L82C9-L82C23
Thoughts? Alternatively, I can fork ListingTable internally if this isn't something we want in upstream, or I'm also open to other ideas? 😄
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: