This script is an example on preparing a WebDataset for an image / video + text dataset using distributed processing with the Cosmos Tokenizer. It processes each sample by generating a continuous image / video latent using the Cosmos video tokenizer and a T5 embedding from the text caption. Then, the processed data is stored in a WebDataset-compatible format.
- Dependencies:
- Please use the latest NeMo dev container:
nvcr.io/nvidia/nemo:dev
- You may also need to installjammy
andmediapy
depending on your dev container version. - Data:
- The script uses an example dataset that comes in parquet format. To use a custom, you will need to write a custom
process_func
and create a new factory recipe that uses your newprocess_func
.
Set up your environment: Pull and launch the NeMo dev container to run your script.
Customize Cache Path: Set the T5 cache directory path in the script by specifying the t5_cache_dir variable.
Running the Script: To run the script on 8 GPUs, use the following command:
bash torchrun --nproc_per_node=8 nemo/collections/diffusion/data/prepare_energon_dataset.py