-
The University of Hong Kong
- China
- https://scholar.google.com/citations?hl=zh-CN&user=1euA66EAAAAJ&view_op=list_works&sortby=pubdate
Stars
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
MAGI-1: Autoregressive Video Generation at Scale
Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL
This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!
verl: Volcano Engine Reinforcement Learning for LLMs
A Unified Tokenizer for Visual Generation and Understanding
Video Generation Foundation Models: https://saiyan-world.github.io/goku/
New repo collection for NVIDIA Cosmos: https://github.com/nvidia-cosmos
A generative world for general-purpose robotics & embodied AI learning.
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.
HunyuanVideo: A Systematic Framework For Large Video Generation Model
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Official Implementation of "Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining"
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
A high-throughput and memory-efficient inference and serving engine for LLMs
SEED-Voken: A Series of Powerful Visual Tokenizers
[ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
MuLan: Adapting Multilingual Diffusion Models for 110+ Languages (无需额外训练为任意扩散模型支持多语言能力)
Lumina-T2X is a unified framework for Text to Any Modality Generation
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
[NeurIPS 2024 Best Paper][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ult…
[CVPR2024] Generative Region-Language Pretraining for Open-Ended Object Detection