Awesome-Generalist-Agents

A curated list of papers for generalist AI agents in both virtual and physical worlds.

Awesome-Generalist-Agents

Generalist Agents in Both Virtual and Physical Worlds

Date	keywords	Paper	Publication	Others
May 2022	Gato	A Generalist Agent	TMLR'22	Report
Feb 2024	Interactive Agent Foundation Model	An Interactive Agent Foundation Model	ArXiv'24	Report

Generalist Embodied Agents

Large Vision-Language (Action) Models

Date	keywords	Paper	Publication	Others
Dec 2022	RT-1	RT-1: Robotics Transformer for Real-World Control at Scale	RSS'23	Project
Mar 2023	PaLM-E	PaLM-E: An Embodied Multimodal Language Model	ArXiv'23	Project
July 2023	RT-2	RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	ArXiv'23	Project
Nov 2023	LEO	An embodied generalist agent in 3d world	ICML'24	Project
Nov 2023	RoboFlamingo	Vision-Language Foundation Models as Effective Robot Imitators	ArXiv'23	Project
Dec 2023	GR-1	Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation	ArXiv'23	Project
Mar 2024	3D-VLA	3D-VLA: A 3D Vision-Language-Action Generative World Model	ICML'24	Project
May 2024	Octo	Octo: An Open-Source Generalist Robot Policy	ArXiv'24	Project
Jun 2024	OpenVLA	OpenVLA: An Open-Source Vision-Language-Action Model	CORL'24	Project
Jun 2024	RoboUniView	RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation	ArXiv'24	Project
Jul 2024	Embodied-CoT	Robotic Control via Embodied Chain-of-Thought Reasoning	ArXiv'24	Project
Jun 2024	LLARVA	LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning	ArXiv'24	Project
Sep 2024	TinyVLA	TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation	ArXiv'24	Project
Oct 2024	GR-2	GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation	ArXiv'24	Project
Oct 2024	LAPA	Latent Action Pretraining from Videos	ArXiv'24	Project
Oct 2024	π0	π0: A Vision-Language-Action Flow Model for General Robot Control	ArXiv'24	Project
Oct 2024	RDT-1B	RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation	ArXiv'24	Project
Nov 2024	CogACT	CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation	ArXiv'24	Project
Nov 2024	DeeR-VLA	DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution	ArXiv'24	Project
Nov 2024	RT-Affordance	RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation	ArXiv'24	Project
Dec 2024	Diffusion-VLA	Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression	ArXiv'24	Project
Dec 2024	RoboVLMs	Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models	ArXiv'24	Project
Dec 2024	Moto	Moto: Latent Motion Token as the Bridging Language for Robot Manipulation	ArXiv'24	Project
Dec 2024	TraceVLA	TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies	ArXiv'24	Project
Dec 2024	NaVILA	NaVILA: Legged Robot Vision-Language-Action Model for Navigation	ArXiv'24	Project
Jan 2025	FAST	FAST: Efficient Action Tokenization for Vision-Language-Action Models	ArXiv'25	Project

Generalist Robotics Policies

Date	keywords	Paper	Publication	Others
Apr 2021	Mt-Opt	Mt-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale	ArXiv'21	Project
Jan 2023	UniPi	Learning Universal Policies via Text-Guided Video Generation	NeurIPS'23	Project
Mar 2023	MOO	Open-World Object Manipulation using Pre-trained Vision-Language Models	CoRL'23	Project
Jun 2023	RoboCat	RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation	ArXiv'23	Report
Sep 2023	RoboAgent	RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking	ICRA'24	Project
Feb 2024	Extreme Cross-Embodiment	Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation	RSS'24	Project
Jun 2024	RoboPoint	RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics	CORL'24	Project
Aug 2024	Crossformer	Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation	CORL'24	Project
Sep 2024	HPT	Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers	NeurIPS'24	Project
Sep 2024	RUMs	Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments	ArXiv'24	Project
Sep 2024	FLaRe	FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning	ArXiv'24	Project
Sep 2024	Neural MP	Neural MP: A Generalist Neural Motion Planner	ArXiv'24	Project
Oct 2024	Law in IL	Data Scaling Laws in Imitation Learning for Robotic Manipulation	ArXiv'24	Project
Dec 2024	RING	The One RING: a Robotic Indoor Navigation Generalist	ArXiv'24	Project
Jan 2025	FUSE	Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding	ArXiv'25	Project

Multimodal World Models

Date	keywords	Paper	Publication	Others
Mar 2018	World Models	World Models	ArXiv'18	Project
Jan 2023	DreamerV3	Mastering Diverse Domains through World Models	ArXiv'23	Project
Aug 2023	Human World Model	Structured World Models from Human Videos	RSS'23	Project
Feb 2024	World Models	The Essential Role of Causality in Foundation World Models for Embodied AI	ArXiv'24	Project
Nov 2024	WHALE	WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making	ArXiv'24	Project

Generlist Web Agents

Generalist Agents for Simulated Worlds

Date	keywords	Paper	Publication	Others
Feb 2024	Agent-Pro	Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization	ACL'24	Project
Dec 2023	LARP	LARP: Language-Agent Role Play for Open-World Games	ArXiv'23	Project
Mar 2024	SIMA	Scaling Instructable Agents Across Many Simulated Worlds	ArXiv'24	Report
Aug 2024	Optimus-1	Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks	ArXiv'24	Project

Generalist Agents for Realistic Tasks

Date	keywords	Paper	Publication	Others
Feb 2023	Toolformer	Toolformer: Language Models Can Teach Themselves to Use Tools	NeurIPS'23	Project
Mar 2023	RCI	Language Models can Solve Computer Tasks	ArXiv'23	Project
Mar 2023	HuggingGPT	HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face	ArXiv'23	Project
May 2023	Pix2Act	From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces	NeurIPS'23	Project
Jul 2023	WebAgent	A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis	ICLR'24	Project
Sep 2023	LASER	LLM Agent with State-Space Exploration for Web Navigation	ArXiv'23	Project
Sep 2023	Auto-GUI	You Only Look at Screens: Multimodal Chain-of-Action Agents	ACL'24	Project
Sep 2023	Agents	Agents: An Open-source Framework for Autonomous Language Agents	ArXiv'23	Project
Oct 2023	AgentTuning	AgentTuning: Enabling Generalized Agent Abilities for LLMs	ArXiv'23	Project
Dec 2023	CogAgent	CogAgent: A Visual Language Model for GUI Agents	CVPR'24	Project
Dec 2023	AppAgent	AppAgent: Multimodal Agents as Smartphone Users	ArXiv'23	Project
Dec 2023	CLOVA	CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update	CVPR 2024	Project
Jan 2024	SeeAct	GPT-4V(ision) is a Generalist Web Agent, if Grounded	ICML'24	Project
Jan 2024	Mobile-Agent	Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception	ArXiv'24	Project
Jan 2024	WebVoyager	WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models	ACL'24	Project
Jan 2024	SeeClick	SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents	ArXiv'24	Project
Jan 2024	Mobile-Agent	Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception	ArXiv'24	Project
Feb 2024	OS-Copilot	OS-Copilot: Towards Generalist Computer Agents with Self-Improvement	ArXiv'24	Project
Feb 2024	ScreenAgent	ScreenAgent: A Vision Language Model-driven Computer Control Agent	ArXiv'24	Project
Feb 2024	Middleware	Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments	EMNLP'2024	Project
Apr 2024	WILBUR	WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents	ArXiv'24	Project
Jul 2024	OmniParser	OmniParser for Pure Vision Based GUI Agent	ArXiv'24	Project
Aug 2024	Agent Q	Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents	ArXiv'24	Project
Oct 2024	OS-ATLAS	OS-ATLAS: A Foundation Action Model for Generalist GUI Agents	ArXiv'24	Project
Nov 2024	ShowUI	ShowUI: One Vision-Language-Action Model for GUI Visual Agent	ArXiv'24	Project
Jan 2025	InfiGUIAgent	InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection	ArXiv'25	Project
Jan 2025	UI-TARS	UI-TARS: Pioneering Automated GUI Interaction with Native Agents	ArXiv'25	Project

Datasets & Benchmarks

For Embodied Agents

Date	keywords	Paper	Publication	Others
Jun 2023	LIBERO	LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning	NeurIPS'23	Project
Oct 2023	Open X-Embodiment	Open X-Embodiment: Robotic Learning Datasets and RT-X Models	ArXiv'24	Project
Oct 2023	GenSim	GenSim: Generating Robotic Simulation Tasks via Large Language Models	ICLR'24	Project
Aug 2024	ARIO	All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents	ArXiv'24	Project
May 2024	Simpler	Evaluating Real-World Robot Manipulation Policies in Simulation	ArXiv'24	Project
Jun 2024	ManiSkill3	ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI	ArXiv'24	Project
Jul 2024	RoboCasa	RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots	ArXiv'24	Project
Jul 2024	GRUtopia	GRUtopia: Dream General Robots in a City at Scale	ArXiv'24	Project
Oct 2024	Genesis	Genesis: A Generative and Universal Physics Engine for Robotics and Beyond	ArXiv'24	Project
Oct 2024	GenSim2	GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs	CORL'24	Project
Dec 2024	RoboMIND	RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation	ArXiv'24	Project
Dec 2024	VLABench	VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks	ArXiv'24	Project
Jan 2024	MuJoCo Playground	MuJoCo Playground	Report'24	Project

For Web Agents

Date	keywords	Paper	Publication	Others
Jul 2022	WebShop	Towards Scalable Real-World Web Interaction with Grounded Language Agents	NeurIPS'22	Project
May 2023	Mobile-Env	Mobile-Env: An Evaluation Platform and Benchmark for Interactive Agents in LLM Era	ArXiv'23	Project
Jun 2023	Mind2Web	Mind2Web: Towards a Generalist Agent for the Web	NeurIPS'23	Project
Jul 2023	WebArena	WebArena: A Realistic Web Environment for Building Autonomous Agents	ICLR'24	Project
Jul 2023	ToolBench	ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs	ICLR'24	Project
Jul 2023	AITW	Android in the Wild: A Large-Scale Dataset for Android Device Control	ArXiv'23	Project
Aug 2023	AgentBench	AgentBench: Evaluating LLMs as Agents	ArXiv'23	Project
Jan 2024	VWA	Visualwebarena: Evaluating multimodal agents on realistic visual web tasks	ACL'2024	Project
Jan 2024	A3	A3: Android Agent Arena for Mobile GUI Agents	ArXiv'24	Project
Feb 2024	TravelPlanner	Travelplanner: A benchmark for real-world planning with language agents	ICML'2024	Project
Feb 2024	OmniACT	OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web	ArXiv'24	Dataset
Mar 2024	WorkArena	WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?	ArXiv'24	Project
Apr 2024	OSWorld	OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments	ArXiv'24	Project
Jul 2024	MMAU	MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains	ArXiv'24	Project
Sep 2024	WindowsAgentArena	Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale	ArXiv'24	Project

General Benchmarks

Date	keywords	Paper	Publication	Others
Aug 2024	VisualAgentBench	VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents	ArXiv'24	Project

🌷

We are currently under ongoing updates and always welcome contributions. If you find any interesting papers that are not included in this collection, feel free to open a pull request.

For any questions or suggestions, please contact Yongyuan Liang or Ruihan Yang.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Generalist-Agents

Generalist Agents in Both Virtual and Physical Worlds

Generalist Embodied Agents

Large Vision-Language (Action) Models

Generalist Robotics Policies

Multimodal World Models

Generlist Web Agents

Generalist Agents for Simulated Worlds

Generalist Agents for Realistic Tasks

Datasets & Benchmarks

For Embodied Agents

For Web Agents

General Benchmarks

🌷

About

Releases

Packages

cheryyunl/awesome-generalist-agents

Folders and files

Latest commit

History

Repository files navigation

Awesome-Generalist-Agents

Generalist Agents in Both Virtual and Physical Worlds

Generalist Embodied Agents

Large Vision-Language (Action) Models

Generalist Robotics Policies

Multimodal World Models

Generlist Web Agents

Generalist Agents for Simulated Worlds

Generalist Agents for Realistic Tasks

Datasets & Benchmarks

For Embodied Agents

For Web Agents

General Benchmarks

🌷

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages