Stars
HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
Janus-Series: Unified Multimodal Understanding and Generation Models
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
Clean, minimal, accessible reproduction of DeepSeek R1-Zero
An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.
[ICCV2023] Official Implementation of "UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation"
Offical PyTorch implementation of "BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework"
[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
SECOND for KITTI/NuScenes object detection
[ECCV 2024] OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection
The official GitHub page for the review paper "Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models".
[AAAI-25] Cobra: Extending Mamba to Multi-modal Large Language Model for Efficient Inference
[ECCV 2022] This is the official implementation of BEVFormer, a camera-only framework for autonomous driving perception, e.g., 3D object detection and semantic map segmentation.
[CVPR 2022] PointCLIP: Point Cloud Understanding by CLIP
[CVPR 2023] Hierarchical Supervision and Shuffle Data Augmentation for 3D Semi-Supervised Object Detection
[AAAI 2024] BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios
[ECCV 2024] Embodied Understanding of Driving Scenarios
HEDNet (NeurIPS 2023) & SAFDNet (CVPR 2024 Oral)
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
🤖 PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+
ChChwang / PaddleViT
Forked from BR-IDL/PaddleViT🤖 PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+
[ICCV 2021] Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages
LAVIS - A One-stop Library for Language-Vision Intelligence
Official implementation of "Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer".
[ICCV 2023] SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection