Stars
Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
Official implementation of Layout-aware Dreamer for Embodied Referring Expression Grounding (AAAI'23).
VMamba: Visual State Space Models,code is based on mamba
[CVPR 2025] Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Visualizer for neural network, deep learning and machine learning models
A curated list of research papers in Vision-Language Navigation (VLN)
Reading list for research topics in embodied vision
Fully open reproduction of DeepSeek-R1
[ICCV 2023 Oral]: Scaling Data Generation in Vision-and-Language Navigation
Official Code for "Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents"
This is the official repository for MAGIC: Meta-Ability Guided Interactive Chain-of-Distillation Learning towards Efficient Vision-and-Language Navigation
[ECCV 2024] Official implementation of NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch
[CVPR 2024] The code for paper 'Towards Learning a Generalist Model for Embodied Navigation'
[CVPR 2025] RoomTour3D - Geometry-aware, cheap and automatic data from web videos for embodied navigation
OpenMMLab Detection Toolbox and Benchmark
Repository for Vision-and-Language Navigation via Causal Learning (Accepted by CVPR 2024)
[CVPR'24] UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
This is the official implementation of Deep Orthogonal Hypersphere Compression for Anomaly Detection, ICLR 2024 (Spotlight).
Official repository for "Revisiting Weakly Supervised Pre-Training of Visual Perception Models". https://arxiv.org/abs/2201.08371.
[IEEE SPL 2023] CPR-CLIP: Multimodal Pre-training for Composite Error Recognition in CPR Training.
Code and Data for Paper: PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation