Official implementation of
Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model by
Xiu Yuan*, Tongzhou Mu*, Stone Tao, Yunhao Fang, Mengke Zhang, Hao Su (UC San Diego)
*Equal Contribution
[Webpage] [Paper] [Video] [Slides]
Large policy models learned by imitation learning are often limited by the quantity, quality, and diversity of demonstrations. We introduce Policy Decorator, which uses a model-agnostic residual policy to refine large policy models during online interactions. By implementing controlled exploration strategies, Policy Decorator enables stable, sample-efficient online learning.
Our evaluation spans eight tasks across two benchmarks—ManiSkill and Adroit—and involves two state-of-the-art imitation learning models (Behavior Transformer and Diffusion Policy). The results show Policy Decorator effectively improves the offline-trained policies and preserves the smooth motion of imitation learning models, avoiding the erratic behaviors of pure RL policies.
- Install all dependencies via
mamba
orconda
by running the following command:
mamba env create -f environment.yml
mamba activate pi-dec
Note: mamba
is a drop-in replacement for conda
. Feel free to use conda
if you prefer it.
- Download and link the necessary assets for ManiSkill
python -m mani_skill2.utils.download_asset partnet_mobility_faucet
python -m mani_skill2.utils.download_asset partnet_mobility_chair
which downloads assets to ./data
. You may move these assets to any location. Then, add the following line to your ~/.bashrc
or ~/.zshrc
:
export MS2_ASSET_DIR=<path>/<to>/<data>
and restart your terminal.
Policy Decorator improves an offline-trained base policy via online interactions. We provide examples on how to online improve a base policy checkpoint, as well as how to train your own base policies, below.
You can skip the base policy triaining and directly use the pre-trained base policy checkpoints (Behavior Transformer and Diffusion Policy) provided by us. The base policy checkpoints can be downloaded here and should be put under ./checkpoints
. See below examples on how to improve pre-trained base policy checkpoints by Policy Decorator.
The following commands should be run under the repo root dir.
Use Diffusion Policy as the base policy:
python online/pi_dec_diffusion_maniskill2.py --env-id PegInsertionSide-v2 --base-policy-ckpt checkpoints/diffusion_PegInsertionSide/checkpoints/best.pt --res-scale 0.1 --prog-explore 30_000
python online/pi_dec_diffusion_maniskill2.py --env-id TurnFaucet-v2 --base-policy-ckpt checkpoints/diffusion_TurnFaucet/checkpoints/best.pt --res-scale 0.1 --prog-explore 100_000 --total-timesteps 2_000_000
python online/pi_dec_diffusion_maniskill2.py --env-id PushChair-v2 --base-policy-ckpt checkpoints/diffusion_PushChair/checkpoints/best.pt --res-scale 0.2 --prog-explore 300_000 --gamma 0.9 --total-timesteps 2_000_000
Use Behavior Transformer as the base policy:
python online/pi_dec_bet_maniskill2.py --env-id StackCube-v0 --base-policy-ckpt checkpoints/bet_StackCube/checkpoints/best.pt --res-scale 0.03 --prog-explore 1_000_000
python online/pi_dec_bet_maniskill2.py --env-id PegInsertionSide-v2 --base-policy-ckpt checkpoints/bet_PegInsertionSide/checkpoints/best.pt --res-scale 0.3 --prog-explore 8_000_000 --policy-lr 3e-4 --q-lr 3e-4 --total-timesteps 10_000_000
python online/pi_dec_bet_maniskill2.py --env-id TurnFaucet-v2 --base-policy-ckpt checkpoints/bet_TurnFaucet/checkpoints/best.pt --res-scale 0.2 --prog-explore 500_000 --policy-lr 3e-4 --q-lr 3e-4
python online/pi_dec_bet_maniskill2.py --env-id PushChair-v2 --base-policy-ckpt checkpoints/bet_PushChair/checkpoints/best.pt --res-scale 0.2 --prog-explore 4_000_000 --total-timesteps 6_000_000
Note:
- If you want to use Weights and Biases (
wandb
) to track learning progress, please add--track
to your commands.
Instead of using our pre-trained base policy checkpoints, you can also train the base policies by yourself. The demonstration datasets can be downloaded here and should be put under ./data
.
The following commands should be run under the repo root dir.
Train Diffusion Policy:
python offline/diffusion_policy_unet_maniskill2.py --env-id PegInsertionSide-v2 --demo-path data/PegInsertionSide/trajectory.h5
python offline/diffusion_policy_unet_maniskill2.py --env-id TurnFaucet-v2 --demo-path data/TurnFaucet/trajectory.h5
python offline/diffusion_policy_unet_maniskill2.py --env-id PushChair-v2 --demo-path data/PushChair/trajectory.h5 --control-mode base_pd_joint_vel_arm_pd_joint_vel --total-iters 300_000
Train Behavior Transformer:
python offline/bet_maniskill2.py --env-id StackCube-v0 --demo-path data/StackCube/trajectory.h5 --control-mode pd_ee_delta_pos --batch-size 4096 --lr 0.001
python offline/bet_maniskill2.py --env-id PegInsertionSide-v2 --demo-path data/PegInsertionSide/trajectory.h5 --n-embedding 256
python offline/bet_maniskill2.py --env-id TurnFaucet-v2 --demo-path data/TurnFaucet/trajectory.h5
python offline/bet_maniskill2.py --env-id PushChair-v2 --demo-path data/PushChair/trajectory.h5 --control-mode base_pd_joint_vel_arm_pd_joint_vel --n-clusters 16
If you find our work useful, please consider citing our paper as follows:
@misc{pi_dec,
title={Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model},
author={Yuan, Xiu and Mu, Tongzhou and Tao, Stone and Fang, Yunhao and Zhang, Mengke and Su, Hao},
year={2024},
eprint={2412.13630},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2412.13630},
}
This codebase is built upon the following repositories: ManiSkill Baselines, CleanRL, minGPT, BeT, and Diffusion Policy.
This project is licensed under the MIT License - see the LICENSE
file for details. Note that the repository relies on third-party code, which is subject to their respective licenses.