Lists (1)
Sort Name ascending (A-Z)
Stars
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
The official implementation of paper "ColorFlow: Retrieval-Augmented Image Sequence Colorization"
This is a replicate of DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data
[ICLR 2025] Reconstructive Visual Instruction Tuning
Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model
Official PyTorch and Diffusers Implementation of "LinFusion: 1 GPU, 1 Minute, 16K Image"
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
A fork to add multimodal model training to open-r1
Janus-Series: Unified Multimodal Understanding and Generation Models
[ICLR'25] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation
[NeurIPS 2024] Generalizable and Animatable Gaussian Head Avatar
[ArXiv 2024] X-Dyna: Expressive Dynamic Human Image Animation
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
FastVideo is a lightweight framework for accelerating large video diffusion models.
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers
Blending Custom Photos with Video Diffusion Transformers
An 8-step inversion and 8-step editing process works effectively with the FLUX-dev model. (3x speedup with results that are comparable or even superior to baseline methods)
Cosmos is a world model development platform that consists of world foundation models, tokenizers and video processing pipeline to accelerate the development of Physical AI at Robotics & AV labs. C…
[Survey] Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
StreamDiffusion: A Pipeline-Level Solution for Real-Time Interactive Generation
Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks
Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation