Support source/sink for plain Parquet/ORC/Avro Tables #166

anoopj · 2023-11-03T00:03:34Z

Supporting plain Parquet/ORC/Avro (partitioned as well as unpartitioned) may be useful for "upgrading" legacy data to table formats. Sink may be useful for exporting a specific snapshot for interoperability reasons.

This feature is lower priority, as Iceberg/Delta etc have native support for metadata-only conversions and offer Spark procedures.

the-other-tim-brown · 2023-11-04T22:05:40Z

@anoopj what would the metadata look like for a sink export?

I like the idea of a generic bootstrap so that users could take existing data and try out all 3 formats if they want to do some testing with other tools.

anoopj · 2023-11-06T16:39:47Z

@anoopj what would the metadata look like for a sink export?

Sink could be based on manifest files in SymlinkTextInputFormat. BigQuery also now supports manifest files.

I like the idea of a generic bootstrap so that users could take existing data and try out all 3 formats if they want to do some testing with other tools.

Yes, bootstrap is probably higher priority than sink.

the-other-tim-brown · 2023-12-20T03:47:03Z

@jackwener any interest in looking into something like this?

marqub · 2024-04-30T14:36:59Z

@the-other-tim-brown I'm trying to find a good first issue to ramp up on XTable. Can I take a look at this one? Perhaps we can split it into different issues. One initial task could be to add support for the Parquet input data format, for example? I'm not sure what the code looks like, but ultimately, we can create something modular enough to extend to AVRO or other formats later, if it hasn't been done already. I would be interested to discuss of the possible approaches to fill up the partitioning and statistics info...

However, just to check that I understand the scenario correctly: if today I wanted to bootstrap 2 different systems, Hudi and Iceberg, with existing Parquet files, couldn't I use the native capabilities of either system for a 1st initial import, and then use the current XTable to generate the metafiles for the remaining system?

the-other-tim-brown · 2024-05-01T05:35:09Z

@the-other-tim-brown I'm trying to find a good first issue to ramp up on XTable. Can I take a look at this one? Perhaps we can split it into different issues. One initial task could be to add support for the Parquet input data format, for example? I'm not sure what the code looks like, but ultimately, we can create something modular enough to extend to AVRO or other formats later, if it hasn't been done already. I would be interested to discuss of the possible approaches to fill up the partitioning and statistics info...

I think it makes sense to start with just one of the file formats like Parquet. We can discuss how to get the info you would need.

However, just to check that I understand the scenario correctly: if today I wanted to bootstrap 2 different systems, Hudi and Iceberg, with existing Parquet files, couldn't I use the native capabilities of either system for a 1st initial import, and then use the current XTable to generate the metafiles for the remaining system?

Yes you could do that as well.

There is another issue I had my eye on that I could guide you through as well if you are interested: #411

marqub · 2024-05-02T08:02:14Z

I think it makes sense to start with just one of the file formats like Parquet. We can discuss how to get the info you would need.

However, just to check that I understand the scenario correctly: if today I wanted to bootstrap 2 different systems, Hudi and Iceberg, with existing Parquet files, couldn't I use the native capabilities of either system for a 1st initial import, and then use the current XTable to generate the metafiles for the remaining system?

Yes you could do that as well.

Ok, if you agree that we want to move away from this workaround approach, then I think supporting Parquet is a good first issue for me to smooth the learning curve.

There is another issue I had my eye on that I could guide you through as well if you are interested: #411

ok, this one could be a good next step, but for now, I prefer to limit the amount of novelty.

I should have some time to start on the parquet issue next week.
How do you prefer to communicate? Is there a slack channel?

the-other-tim-brown · 2024-05-04T01:10:50Z

@marqub we do not have a slack setup for the project yet, I can shoot you an email to connect and discuss any of the details in the meantime.

Reactor11 · 2024-10-10T05:32:21Z

Hi, Is someone working on it? I am new to this project and would like to get started.

the-other-tim-brown added enhancement New feature or request good first issue Good for newcomers labels Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support source/sink for plain Parquet/ORC/Avro Tables #166

Support source/sink for plain Parquet/ORC/Avro Tables #166

anoopj commented Nov 3, 2023

the-other-tim-brown commented Nov 4, 2023

anoopj commented Nov 6, 2023

the-other-tim-brown commented Dec 20, 2023

marqub commented Apr 30, 2024

the-other-tim-brown commented May 1, 2024

marqub commented May 2, 2024

the-other-tim-brown commented May 4, 2024

Reactor11 commented Oct 10, 2024

Support source/sink for plain Parquet/ORC/Avro Tables #166

Support source/sink for plain Parquet/ORC/Avro Tables #166

Comments

anoopj commented Nov 3, 2023

the-other-tim-brown commented Nov 4, 2023

anoopj commented Nov 6, 2023

the-other-tim-brown commented Dec 20, 2023

marqub commented Apr 30, 2024

the-other-tim-brown commented May 1, 2024

marqub commented May 2, 2024

the-other-tim-brown commented May 4, 2024

Reactor11 commented Oct 10, 2024