Skip to content

cheryyunl/awesome-generalist-agents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 

Repository files navigation

Awesome-Generalist-Agents Awesome Maintenance PR's Welcome

A curated list of papers for generalist AI agents in both virtual and physical worlds.


Generalist Agents in Both Virtual and Physical Worlds

Date keywords Paper Publication Others
May 2022 Gato A Generalist Agent TMLR'22 Report
Feb 2024 Interactive Agent Foundation Model An Interactive Agent Foundation Model ArXiv'24 Report

Generalist Embodied Agents

Large Vision-Language (Action) Models

Date keywords Paper Publication Others
Dec 2022 RT-1 RT-1: Robotics Transformer for Real-World Control at Scale RSS'23 Project
Mar 2023 PaLM-E PaLM-E: An Embodied Multimodal Language Model ArXiv'23 Project
July 2023 RT-2 RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control ArXiv'23 Project
Nov 2023 LEO An embodied generalist agent in 3d world ICML'24 Project
Nov 2023 RoboFlamingo Vision-Language Foundation Models as Effective Robot Imitators ArXiv'23 Project
Dec 2023 GR-1 Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation ArXiv'23 Project
Mar 2024 3D-VLA 3D-VLA: A 3D Vision-Language-Action Generative World Model ICML'24 Project
May 2024 Octo Octo: An Open-Source Generalist Robot Policy ArXiv'24 Project
Jun 2024 OpenVLA OpenVLA: An Open-Source Vision-Language-Action Model CORL'24 Project
Jun 2024 RoboUniView RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation ArXiv'24 Project
Jul 2024 Embodied-CoT Robotic Control via Embodied Chain-of-Thought Reasoning ArXiv'24 Project
Jun 2024 LLARVA LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning ArXiv'24 Project
Sep 2024 TinyVLA TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation ArXiv'24 Project
Oct 2024 GR-2 GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation ArXiv'24 Project
Oct 2024 LAPA Latent Action Pretraining from Videos ArXiv'24 Project
Oct 2024 π0 π0: A Vision-Language-Action Flow Model for General Robot Control ArXiv'24 Project
Oct 2024 RDT-1B RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation ArXiv'24 Project
Nov 2024 CogACT CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation ArXiv'24 Project
Nov 2024 DeeR-VLA DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution ArXiv'24 Project
Nov 2024 RT-Affordance RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation ArXiv'24 Project
Dec 2024 Diffusion-VLA Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression ArXiv'24 Project
Dec 2024 RoboVLMs Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models ArXiv'24 Project
Dec 2024 Moto Moto: Latent Motion Token as the Bridging Language for Robot Manipulation ArXiv'24 Project
Dec 2024 TraceVLA TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies ArXiv'24 Project
Dec 2024 NaVILA NaVILA: Legged Robot Vision-Language-Action Model for Navigation ArXiv'24 Project
Jan 2025 FAST FAST: Efficient Action Tokenization for Vision-Language-Action Models ArXiv'25 Project

Generalist Robotics Policies

Date keywords Paper Publication Others
Apr 2021 Mt-Opt Mt-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale ArXiv'21 Project
Jan 2023 UniPi Learning Universal Policies via Text-Guided Video Generation NeurIPS'23 Project
Mar 2023 MOO Open-World Object Manipulation using Pre-trained Vision-Language Models CoRL'23 Project
Jun 2023 RoboCat RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation ArXiv'23 Report
Sep 2023 RoboAgent RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking ICRA'24 Project
Feb 2024 Extreme Cross-Embodiment Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation RSS'24 Project
Jun 2024 RoboPoint RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics CORL'24 Project
Aug 2024 Crossformer Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation CORL'24 Project
Sep 2024 HPT Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers NeurIPS'24 Project
Sep 2024 RUMs Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments ArXiv'24 Project
Sep 2024 FLaRe FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning ArXiv'24 Project
Sep 2024 Neural MP Neural MP: A Generalist Neural Motion Planner ArXiv'24 Project
Oct 2024 Law in IL Data Scaling Laws in Imitation Learning for Robotic Manipulation ArXiv'24 Project
Dec 2024 RING The One RING: a Robotic Indoor Navigation Generalist ArXiv'24 Project
Jan 2025 FUSE Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding ArXiv'25 Project

Multimodal World Models

Date keywords Paper Publication Others
Mar 2018 World Models World Models ArXiv'18 Project
Jan 2023 DreamerV3 Mastering Diverse Domains through World Models ArXiv'23 Project
Aug 2023 Human World Model Structured World Models from Human Videos RSS'23 Project
Feb 2024 World Models The Essential Role of Causality in Foundation World Models for Embodied AI ArXiv'24 Project
Nov 2024 WHALE WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making ArXiv'24 Project

Generlist Web Agents

Generalist Agents for Simulated Worlds

Date keywords Paper Publication Others
Feb 2024 Agent-Pro Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization ACL'24 Project
Dec 2023 LARP LARP: Language-Agent Role Play for Open-World Games ArXiv'23 Project
Mar 2024 SIMA Scaling Instructable Agents Across Many Simulated Worlds ArXiv'24 Report
Aug 2024 Optimus-1 Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks ArXiv'24 Project

Generalist Agents for Realistic Tasks

Date keywords Paper Publication Others
Feb 2023 Toolformer Toolformer: Language Models Can Teach Themselves to Use Tools NeurIPS'23 Project
Mar 2023 RCI Language Models can Solve Computer Tasks ArXiv'23 Project
Mar 2023 HuggingGPT HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face ArXiv'23 Project
May 2023 Pix2Act From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces NeurIPS'23 Project
Jul 2023 WebAgent A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis ICLR'24 Project
Sep 2023 LASER LLM Agent with State-Space Exploration for Web Navigation ArXiv'23 Project
Sep 2023 Auto-GUI You Only Look at Screens: Multimodal Chain-of-Action Agents ACL'24 Project
Sep 2023 Agents Agents: An Open-source Framework for Autonomous Language Agents ArXiv'23 Project
Oct 2023 AgentTuning AgentTuning: Enabling Generalized Agent Abilities for LLMs ArXiv'23 Project
Dec 2023 CogAgent CogAgent: A Visual Language Model for GUI Agents CVPR'24 Project
Dec 2023 AppAgent AppAgent: Multimodal Agents as Smartphone Users ArXiv'23 Project
Dec 2023 CLOVA CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update CVPR 2024 Project
Jan 2024 SeeAct GPT-4V(ision) is a Generalist Web Agent, if Grounded ICML'24 Project
Jan 2024 Mobile-Agent Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception ArXiv'24 Project
Jan 2024 WebVoyager WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models ACL'24 Project
Jan 2024 SeeClick SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents ArXiv'24 Project
Jan 2024 Mobile-Agent Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception ArXiv'24 Project
Feb 2024 OS-Copilot OS-Copilot: Towards Generalist Computer Agents with Self-Improvement ArXiv'24 Project
Feb 2024 ScreenAgent ScreenAgent: A Vision Language Model-driven Computer Control Agent ArXiv'24 Project
Feb 2024 Middleware Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments EMNLP'2024 Project
Apr 2024 WILBUR WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents ArXiv'24 Project
Jul 2024 OmniParser OmniParser for Pure Vision Based GUI Agent ArXiv'24 Project
Aug 2024 Agent Q Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents ArXiv'24 Project
Oct 2024 OS-ATLAS OS-ATLAS: A Foundation Action Model for Generalist GUI Agents ArXiv'24 Project
Nov 2024 ShowUI ShowUI: One Vision-Language-Action Model for GUI Visual Agent ArXiv'24 Project
Jan 2025 InfiGUIAgent InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection ArXiv'25 Project
Jan 2025 UI-TARS UI-TARS: Pioneering Automated GUI Interaction with Native Agents ArXiv'25 Project

Datasets & Benchmarks

For Embodied Agents

Date keywords Paper Publication Others
Jun 2023 LIBERO LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning NeurIPS'23 Project
Oct 2023 Open X-Embodiment Open X-Embodiment: Robotic Learning Datasets and RT-X Models ArXiv'24 Project
Oct 2023 GenSim GenSim: Generating Robotic Simulation Tasks via Large Language Models ICLR'24 Project
Aug 2024 ARIO All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents ArXiv'24 Project
May 2024 Simpler Evaluating Real-World Robot Manipulation Policies in Simulation ArXiv'24 Project
Jun 2024 ManiSkill3 ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI ArXiv'24 Project
Jul 2024 RoboCasa RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots ArXiv'24 Project
Jul 2024 GRUtopia GRUtopia: Dream General Robots in a City at Scale ArXiv'24 Project
Oct 2024 Genesis Genesis: A Generative and Universal Physics Engine for Robotics and Beyond ArXiv'24 Project
Oct 2024 GenSim2 GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs CORL'24 Project
Dec 2024 RoboMIND RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation ArXiv'24 Project
Dec 2024 VLABench VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks ArXiv'24 Project
Jan 2024 MuJoCo Playground MuJoCo Playground Report'24 Project

For Web Agents

Date keywords Paper Publication Others
Jul 2022 WebShop Towards Scalable Real-World Web Interaction with Grounded Language Agents NeurIPS'22 Project
May 2023 Mobile-Env Mobile-Env: An Evaluation Platform and Benchmark for Interactive Agents in LLM Era ArXiv'23 Project
Jun 2023 Mind2Web Mind2Web: Towards a Generalist Agent for the Web NeurIPS'23 Project
Jul 2023 WebArena WebArena: A Realistic Web Environment for Building Autonomous Agents ICLR'24 Project
Jul 2023 ToolBench ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs ICLR'24 Project
Jul 2023 AITW Android in the Wild: A Large-Scale Dataset for Android Device Control ArXiv'23 Project
Aug 2023 AgentBench AgentBench: Evaluating LLMs as Agents ArXiv'23 Project
Jan 2024 VWA Visualwebarena: Evaluating multimodal agents on realistic visual web tasks ACL'2024 Project
Jan 2024 A3 A3: Android Agent Arena for Mobile GUI Agents ArXiv'24 Project
Feb 2024 TravelPlanner Travelplanner: A benchmark for real-world planning with language agents ICML'2024 Project
Feb 2024 OmniACT OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web ArXiv'24 Dataset
Mar 2024 WorkArena WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? ArXiv'24 Project
Apr 2024 OSWorld OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments ArXiv'24 Project
Jul 2024 MMAU MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains ArXiv'24 Project
Sep 2024 WindowsAgentArena Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale ArXiv'24 Project

General Benchmarks

Date keywords Paper Publication Others
Aug 2024 VisualAgentBench VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents ArXiv'24 Project

🌷

We are currently under ongoing updates and always welcome contributions. If you find any interesting papers that are not included in this collection, feel free to open a pull request.

For any questions or suggestions, please contact Yongyuan Liang or Ruihan Yang.

About

A curated list of papers for generalist agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published