Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore empty (parquet) files when using ListingTable #13737

Closed
Blizzara opened this issue Dec 11, 2024 · 2 comments · Fixed by #13750
Closed

Ignore empty (parquet) files when using ListingTable #13737

Blizzara opened this issue Dec 11, 2024 · 2 comments · Fixed by #13750
Labels
enhancement New feature or request

Comments

@Blizzara
Copy link
Contributor

Is your feature request related to a problem or challenge?

We're using ListingTable with an object_store. Sometimes our input dataset may contain empty parquet files (like literally empty as in being 0 bytes in length). Our spark-based codepaths succeed in "reading" those files (skipping them), but DataFusion fails hard:

ParquetError(EOF("file size of 0 is less than footer"))

This error is presumably thrown by https://github.com/apache/arrow-rs/blob/06a015770098a569b67855dfaa18bdfa7c18ff92/parquet/src/file/metadata/reader.rs#L543.

I think a possible fix/improvement would be to filter out empty files for example in

let result = store.list_with_delimiter(prefix).await?;
and
.map_err(DataFusionError::ObjectStore)

with something like

.try_filter(|object_meta| object_meta.size > 0))

This would align with Spark: https://github.com/apache/spark/blob/b2c8b3069ef4f5288a5964af0da6f6b23a769e6b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L82C9-L82C23

Thoughts? Alternatively, I can fork ListingTable internally if this isn't something we want in upstream, or I'm also open to other ideas? 😄

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@Blizzara Blizzara added the enhancement New feature or request label Dec 11, 2024
@findepi
Copy link
Member

findepi commented Dec 12, 2024

We're using ListingTable with an object_store. Sometimes our input dataset may contain empty parquet files (like literally empty as in being 0 bytes in length). Our spark-based codepaths succeed in "reading" those files (skipping them),

I don't know whether Parquet spec allows such files, but i've seen Hive doing this trick (for ORC and probably Parquet as well). IIRC, the easiest way to get such files is to have a bucketed table and force Hive to create all buckets but provide data only for some buckets. I don't remember whether other engines create such files too.

TL;DR yes, I believe we should skip those, not fail.

@Blizzara
Copy link
Contributor Author

Thanks @findepi , I filed a PR here: #13750!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants