Output dataset in WebDataset format #29

galv · 2021-06-23T04:28:27Z

Webdataset format is preferable for data distribution for a few reasons:

Easy to use without installing dependencies because it's just .tar files.
Natively supports the concept of sharding. Each shard is a single .tar file.
Battle-tested in NeMo. Several published results were trained on datasets stored in webdataset format.
Not as crazy as TFRecord. No funkiness with serializing your half precision float data as 8-bit chars.

This is all great, but there's a pickle: Spark doesn't have support for reading and writing in this format!

A straightforward to get around this sort of issue is to "repartition()" or "coalesce()" the dataframe so that each partition is a reasonable size (we probably want 2-4GiB for each tar file). Then we can call foreachPartition to run conversion of each partition to a tar file in python: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.foreachPartition.html We could then save each file to remote storage using tf.io.gfile, which doesn't stall like gcsfuse does. It also isn't clear to me from the documentation whether or not I can use pandas_udfs with foreachPartition (although using the existing pickle serialization format may be fine for this use case).

I'm not 100% certain if that will work, though. We may run out of memory converting data from Spark's UnsafeRow format to Arrow format for python. One may wonder why I don't propose writing the foreachPartition function in Java. I would like to, but java doesn't have good support for the tar file format for some reason. All I can find is this: https://github.com/kamranzafar/jtar Python has built-in support, meanwhile: https://docs.python.org/3/library/tarfile.html

Maybe tar file format is easy enough to write a parser and serializer for anyway in Java or Scala from scratch, but I doubt it. the tarfile implementation in python is 2500 lines long.

The other alternative is to create a new "DataSource" for tar files in spark. Since Spark is commonly used for machine learning, it seems like support for Webdataset format is something we might want to make a publicly distributed spark plugin for to contribute back to the community.

galv · 2021-06-23T04:36:08Z

Okay, it looks like Apache Commons Compress library supports tar file format, so I would prefer to use that, going the java route. This actually seems pretty reasonable to do. The binary jar is only 632KB in total (!) http://commons.apache.org/proper/commons-compress/javadocs/api-release/index.html

galv self-assigned this Jun 23, 2021

galv mentioned this issue Jun 25, 2021

WIP: Tar File Format DataSource for Spark #35

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output dataset in WebDataset format #29

Output dataset in WebDataset format #29

galv commented Jun 23, 2021

galv commented Jun 23, 2021

Output dataset in WebDataset format #29

Output dataset in WebDataset format #29

Comments

galv commented Jun 23, 2021

galv commented Jun 23, 2021