A tiny video generation model with a microdiffusion-type architecture.
Dataset used => tensorkelechi/tiny_webvid_latents (this is for testing, will process a larger one with more compute)
- Microdiffusion => research paper.
- SwayStar123/microdiffusion => MicroDiT implementation for images (pytorch, unofficial but works)
- mochi-1 and HunyuanVideo => adapted video-specific stuff from their codebases