Stars
Everything you need to build state-of-the-art foundation models, end-to-end.
Align Anything: Training All-modality Model with Feedback
GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 20…
MambaOut: Do We Really Need Mamba for Vision?
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
verl: Volcano Engine Reinforcement Learning for LLMs
Official inference repo for FLUX.1 models
FastVideo is a lightweight framework for accelerating large video diffusion models.
Tile primitives for speedy kernels
fanshiqing / grouped_gemm
Forked from tgale96/grouped_gemmPyTorch bindings for CUTLASS grouped GEMM.
[NeurIPS 2024 Best Paper][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ult…
The paper collections for the autoregressive models in vision.
PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
[ICML 2024 Spotlight] FiT: Flexible Vision Transformer for Diffusion Model
nnScaler: Compiling DNN models for Parallel Training
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
✨✨Latest Advances on Multimodal Large Language Models
Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio…
The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Efficient Triton Kernels for LLM Training
SGLang is a fast serving framework for large language models and vision language models.
🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton