Policy Decorator

Official implementation of

Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model by

Xiu Yuan*, Tongzhou Mu*, Stone Tao, Yunhao Fang, Mengke Zhang, Hao Su (UC San Diego)

*Equal Contribution

Overview

Large policy models learned by imitation learning are often limited by the quantity, quality, and diversity of demonstrations. We introduce Policy Decorator, which uses a model-agnostic residual policy to refine large policy models during online interactions. By implementing controlled exploration strategies, Policy Decorator enables stable, sample-efficient online learning.

Our evaluation spans eight tasks across two benchmarks—ManiSkill and Adroit—and involves two state-of-the-art imitation learning models (Behavior Transformer and Diffusion Policy). The results show Policy Decorator effectively improves the offline-trained policies and preserves the smooth motion of imitation learning models, avoiding the erratic behaviors of pure RL policies.

Installation

Install all dependencies via mamba or conda by running the following command:

mamba env create -f environment.yml
mamba activate pi-dec

Note: mamba is a drop-in replacement for conda. Feel free to use conda if you prefer it.

Download and link the necessary assets for ManiSkill

python -m mani_skill2.utils.download_asset partnet_mobility_faucet
python -m mani_skill2.utils.download_asset partnet_mobility_chair

which downloads assets to ./data. You may move these assets to any location. Then, add the following line to your ~/.bashrc or ~/.zshrc:

export MS2_ASSET_DIR=<path>/<to>/<data>

and restart your terminal.

Run Experiments

Policy Decorator improves an offline-trained base policy via online interactions. We provide examples on how to online improve a base policy checkpoint, as well as how to train your own base policies, below.

Online Improvement

You can skip the base policy triaining and directly use the pre-trained base policy checkpoints (Behavior Transformer and Diffusion Policy) provided by us. The base policy checkpoints can be downloaded here and should be put under ./checkpoints. See below examples on how to improve pre-trained base policy checkpoints by Policy Decorator.

The following commands should be run under the repo root dir.

Use Diffusion Policy as the base policy:

python online/pi_dec_diffusion_maniskill2.py --env-id PegInsertionSide-v2 --base-policy-ckpt checkpoints/diffusion_PegInsertionSide/checkpoints/best.pt --res-scale 0.1 --prog-explore 30_000

python online/pi_dec_diffusion_maniskill2.py --env-id TurnFaucet-v2 --base-policy-ckpt checkpoints/diffusion_TurnFaucet/checkpoints/best.pt --res-scale 0.1 --prog-explore 100_000 --total-timesteps 2_000_000

python online/pi_dec_diffusion_maniskill2.py --env-id PushChair-v2 --base-policy-ckpt checkpoints/diffusion_PushChair/checkpoints/best.pt --res-scale 0.2 --prog-explore 300_000 --gamma 0.9 --total-timesteps 2_000_000

Use Behavior Transformer as the base policy:

python online/pi_dec_bet_maniskill2.py --env-id StackCube-v0 --base-policy-ckpt checkpoints/bet_StackCube/checkpoints/best.pt --res-scale 0.03 --prog-explore 1_000_000

python online/pi_dec_bet_maniskill2.py --env-id PegInsertionSide-v2 --base-policy-ckpt checkpoints/bet_PegInsertionSide/checkpoints/best.pt --res-scale 0.3 --prog-explore 8_000_000 --policy-lr 3e-4 --q-lr 3e-4 --total-timesteps 10_000_000

python online/pi_dec_bet_maniskill2.py --env-id TurnFaucet-v2 --base-policy-ckpt checkpoints/bet_TurnFaucet/checkpoints/best.pt --res-scale 0.2 --prog-explore 500_000 --policy-lr 3e-4 --q-lr 3e-4

python online/pi_dec_bet_maniskill2.py --env-id PushChair-v2 --base-policy-ckpt checkpoints/bet_PushChair/checkpoints/best.pt --res-scale 0.2 --prog-explore 4_000_000 --total-timesteps 6_000_000

Note:

If you want to use Weights and Biases (wandb) to track learning progress, please add --track to your commands.

Offline Imitation Learning

Instead of using our pre-trained base policy checkpoints, you can also train the base policies by yourself. The demonstration datasets can be downloaded here and should be put under ./data.

The following commands should be run under the repo root dir.

Train Diffusion Policy:

python offline/diffusion_policy_unet_maniskill2.py --env-id PegInsertionSide-v2 --demo-path data/PegInsertionSide/trajectory.h5

python offline/diffusion_policy_unet_maniskill2.py --env-id TurnFaucet-v2 --demo-path data/TurnFaucet/trajectory.h5

python offline/diffusion_policy_unet_maniskill2.py --env-id PushChair-v2 --demo-path data/PushChair/trajectory.h5 --control-mode base_pd_joint_vel_arm_pd_joint_vel --total-iters 300_000

Train Behavior Transformer:

python offline/bet_maniskill2.py --env-id StackCube-v0 --demo-path data/StackCube/trajectory.h5 --control-mode pd_ee_delta_pos --batch-size 4096 --lr 0.001 

python offline/bet_maniskill2.py --env-id PegInsertionSide-v2 --demo-path data/PegInsertionSide/trajectory.h5 --n-embedding 256

python offline/bet_maniskill2.py --env-id TurnFaucet-v2 --demo-path data/TurnFaucet/trajectory.h5

python offline/bet_maniskill2.py --env-id PushChair-v2 --demo-path data/PushChair/trajectory.h5 --control-mode base_pd_joint_vel_arm_pd_joint_vel --n-clusters 16

Citation

If you find our work useful, please consider citing our paper as follows:

@misc{pi_dec,
  title={Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model},
  author={Yuan, Xiu and Mu, Tongzhou and Tao, Stone and Fang, Yunhao and Zhang, Mengke and Su, Hao},
  year={2024},
  eprint={2412.13630},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2412.13630}, 
}

Acknowledgments

This codebase is built upon the following repositories: ManiSkill Baselines, CleanRL, minGPT, BeT, and Diffusion Policy.

License

This project is licensed under the MIT License - see the LICENSE file for details. Note that the repository relies on third-party code, which is subject to their respective licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
envs		envs
nets		nets
offline		offline
online		online
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Policy Decorator

Overview

Installation

Run Experiments

Online Improvement

Offline Imitation Learning

Citation

Acknowledgments

License

About

Releases

Packages

Languages

License

tongzhoumu/policy_decorator

Folders and files

Latest commit

History

Repository files navigation

Policy Decorator

Overview

Installation

Run Experiments

Online Improvement

Offline Imitation Learning

Citation

Acknowledgments

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages