Stars
Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable…
AAAI 2025: Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
Janus-Series: Unified Multimodal Understanding and Generation Models
✨✨Latest Advances on Multimodal Large Language Models
A curated list of balanced multimodal learning methods.
这是一个从头训练大语言模型的项目,包括预训练、微调和直接偏好优化,模型拥有1B参数,支持中英文。
📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen2.5, Llama4, InternLM3, GLM4, Mistral, Yi1.5, DeepSeek-R1, ...) and 200+ MLLMs (Qwen2.5-VL, Qwen2.5-Omni, Qwen2-Audio, Ovis2, InternVL3…
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Fully open reproduction of DeepSeek-R1
Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]
Solve Visual Understanding with Reinforced VLMs
Official implementation of paper "OED: Towards One-stage End-to-End Dynamic Scene Graph Generation".
A video database bridging human actions and human-object relationships
This my attempt to create Self-Correcting-LLM based on the paper Training Language Models to Self-Correct via Reinforcement Learning by google
Research Code for "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL"
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
《动手学深度学习》:面向中文读者、能运行、可讨论。中英文版被70多个国家的500多所大学用于教学。