This repository provides the official PyTorch implementation of the following paper:
GRAPE: Generalizing Robot Policy via Preference Alignment
Zijian Zhang1,*, Kaiyuan Zheng2,*, Zhaorun Chen3,*, Joel Jang2, Yi Li2, Chaoqi Wang3, Mingyu Ding1, Dieter Fox2, Huaxiu Yao11UNC Chapel-Hill, 2University of Washington, 3University of Chicago
* Equal contribution
Installation | Training VLA model via TPO-LoRA | Datasets | Evaluating GRAPE | Project Website
We release the codebase for the GRAPE framework, which includes the following components:
-
Customized Cost Generation: GRAPE decomposes complex manipulation tasks into multiple independent stages and leverages Vision-Language Models (VLMs) to generate relevant constraints for each stage.
-
Iterative Trajectory-wise Preference Optimization (TPO): Our iterative TPO framework enables refinement and improvement of VLA models over multiple training cycles.
-
Model Evaluation: Comprehensive evaluation of the GRAPE framework is supported on two benchmarks: Simpler-Env and LIBERO, providing rigorous testing for generalizability and performance.
Built on top of OpenVLA(https://github.com/openvla/openvla).
Use the setup commands below to get started:
# Create and activate conda environment
conda create -n GRAPE python=3.10 -y
conda activate GRAPE
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia -y
# Install the modified openvla repo
cd TPO-Train
pip install -e .
# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
# =>> If you run into difficulty, try `pip cache remove flash_attn` first
pip install packaging ninja
pip install "flash-attn==2.5.5" --no-build-isolation
If you run into any problems during the installation process, please file a GitHub Issue.
In this section, we discuss training OpenVLA model via Trajectory-wise Preference Optimization(TPO). The main script for LoRA
training is finetune.py
.
Below we show an example of how you can train the main OpenVLA-SFT checkpoint (openvla-7b
)
via LoRA-TPO. Here we use a single A100 GPU with 80 GB VRAM. (Attention: We only support batchsize=1 and single-GPU training now, which means each batch has a pair of trajectories. We will support more settings and fully training in the future)
Now, launch the TPO-LoRA script, as shown below.
torchrun --standalone --nnodes=1 --nproc-per-node 1 finetune.py \
--vla_path <PATH TO REFERENCE MODEL> \
--dataset_name "rlds_np_rollout" \
--chosen_traj_dir <PATH TO CHOSEN TRAJECTORY DATASET> \
--rejected_traj_dir <PATH TO REJECTED TRAJECTORY DATASET> \
--run_root_dir <PATH TO LOG/CHECKPOINT DIR> \
--adapter_tmp_dir <PATH TO TEMPORARY DIR TO SAVE ADAPTER WEIGHTS> \
--lora_rank 32 \
--batch_size 1 \
--grad_accumulation_steps 1 \
--learning_rate 2e-5 \
--image_aug False \
--wandb_project <YOUR PROJECT NAME> \
--wandb_entity <YOUR ENTITY> \
--save_steps 1000
For details about chosen_traj and rejected_traj, You can refer Datasets .
We have carefully designed our dataset to support the RLDS dataset format. Specifically, we ensured the trajectories were paired one-to-one when we built the chosen_traj and rejected_traj datasets. That is, the nth trajectory data in chosen_traj corresponds to the nth trajectory data in rejected_traj, and they are from the same task with the consistent initial state.
Note that our Guided-cost Preference Generation is available in Simpler-Env. Support for the Real-World environment will be provided later.
Our data collection in Simpler-Env is embedded in Simpler-Env evaluation. You can collect trajectory data by modifying Simpler-Env files.
What you need to do for data collection is:
- Overwrite ./Simpler-env/simpler_env/evaluation/maniskill2_evaluator.py with./Data Collection/maniskill2_evaluator.py.
- Overwrite the modeling_prismatic.py in your tpo-model's folder with./Data Collection/modeling_prismatic.py.
- Then you can refer to Simpler-Env for running.
For preference generation, the relevant code could be found in /Data Collection/maniskill2_evaluator.py. The final GCPG reward of a trajectory will be shown in its filename. You can rank these trajectories in a task based on the GCPG reward.
It is highly recommanded that you should modify beta and threshold for your experiments. The GCPG reward works well when
Our data collection in LIBERO is embedded in LIBERO evaluation, too. You can collect trajectory data by modifying LIBERO files.
What you need to do is:
- Overwrite experiments/robot/libero/run_libero_eval.py with ./Data Collection/libero_data_collect.py.
- Overwrite the modeling_prismatic.py in your tpo-model's folder with./Data Collection/modeling_prismatic.py.
- Then you can refer to LIBERO for running.
Our RLDS Dataset Conversion is based on the repo from OpenVLA team: https://github.com/kpertsch/rlds_dataset_builder
To convert the npy file to RLDS file which could be used for TPO:
- Follow the instruction in https://github.com/kpertsch/rlds_dataset_builder, complete relevent installation and modify file names.
- Replace the code in example_dataset/example_dataset_dataset_builder.py by the code of our rlds_convert.py
- Then you will get RLDS dataset you need.
Note that the chosen dataset and the rejected dataset should be converted seperately.
We support two evaluation benchmarks in simulation environments: Simpler-Env and LIBERO
Note: We use Colab for our evaluation experiments. Settings in Colab and local GPU may be different. Please feel free to file a Github issue for any problems here.
Use the setup commands below to get started:
# Install vulkan for rendering
apt-get install -yqq --no-install-recommends libvulkan-dev vulkan-tools
# below fixes some bugs introduced by some recent Colab changes
mkdir -p /usr/share/vulkan/icd.d
wget -q -P /usr/share/vulkan/icd.d https://raw.githubusercontent.com/haosulab/ManiSkill/main/docker/nvidia_icd.json
wget -q -O /usr/share/glvnd/egl_vendor.d/10_nvidia.json https://raw.githubusercontent.com/haosulab/ManiSkill/main/docker/10_nvidia.json
# Install Real2Sim
pip install numpy==1.24.4
pip install orbax-checkpoint==0.4.4
pip install scipy==1.12.0
pip install keras==2.15.0
pip install tensorflow==2.15.1
# Install OpenVLA dependency
pip install torch==2.3.1 torchvision==0.18.1 timm==0.9.10 tokenizers==0.15.2 accelerate==0.32.1
pip install flash-attn==2.6.1 --no-build-isolation
pip install --quiet tf_agents
pip install --quiet mediapy
pip install peft
# Install Simpler-Env
cd Simpler-Env\ManiSkill2_real2sim
pip install -e .
cd ..
pip install -e .
Note that if you want to eval the SFT model we released. The unnorm_key in Simpler-env/simpler_env/policies/openvla/openvla_model.py should be set as "bridge_orig", which aligns with OpenVLA's setting.
python simpler_env/main_inference.py --policy-model openvla --ckpt-path "/path/to/tpo_model" \
--robot widowx --policy-setup widowx_bridge \
--control-freq 5 --sim-freq 500 --max-episode-steps 100 \
--env-name PutCarrotOnPlateInScene-v0 --scene-name bridge_table_1_v1 \
--rgb-overlay-path ./ManiSkill2_real2sim/data/real_inpainting/bridge_real_eval_1.png \
--robot-init-x 0.147 0.147 1 --robot-init-y 0.028 0.028 1 --obj-variation-mode episode --obj-episode-range 0 50 \
--robot-init-rot-quat-center 0 0 0 1 --robot-init-rot-rpy-range 0 0 1 0 0 1 0 0 1
python simpler_env/main_inference.py --policy-model openvla --ckpt-path "/path/to/tpo_model" \
--robot widowx --policy-setup widowx_bridge \
--control-freq 5 --sim-freq 500 --max-episode-steps 100 \
--env-name StackGreenCubeOnYellowCubeBakedTexInScene-v0 --scene-name bridge_table_1_v1 \
--rgb-overlay-path ./ManiSkill2_real2sim/data/real_inpainting/bridge_real_eval_1.png \
--robot-init-x 0.147 0.147 1 --robot-init-y 0.028 0.028 1 --obj-variation-mode episode --obj-episode-range 0 20 \
--robot-init-rot-quat-center 0 0 0 1 --robot-init-rot-rpy-range 0 0 1 0 0 1 0 0 1
python simpler_env/main_inference.py --policy-model openvla --ckpt-path "/path/to/tpo_model" \
--robot widowx --policy-setup widowx_bridge \
--control-freq 5 --sim-freq 500 --max-episode-steps 100 \
--env-name PutSpoonOnTableClothInScene-v0 --scene-name bridge_table_1_v1 \
--rgb-overlay-path ./ManiSkill2_real2sim/data/real_inpainting/bridge_real_eval_1.png \
--robot-init-x 0.147 0.147 1 --robot-init-y 0.028 0.028 1 --obj-variation-mode episode --obj-episode-range 0 20 \
--robot-init-rot-quat-center 0 0 0 1 --robot-init-rot-rpy-range 0 0 1 0 0 1 0 0 1
python simpler_env/main_inference.py --policy-model openvla --ckpt-path "/path/to/tpo_model" \
--robot widowx_sink_camera_setup --policy-setup widowx_bridge \
--control-freq 5 --sim-freq 500 --max-episode-steps 100 \
--env-name PutEggplantInBasketScene-v0 --scene-name bridge_table_1_v2 \
--rgb-overlay-path ./ManiSkill2_real2sim/data/real_inpainting/bridge_sink.png \
--robot-init-x 0.127 0.127 1 --robot-init-y 0.06 0.06 1 --obj-variation-mode episode --obj-episode-range 0 20 \
--robot-init-rot-quat-center 0 0 0 1 --robot-init-rot-rpy-range 0 0 1 0 0 1 0 0 1
Clone and install the LIBERO repo:
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -e .
Additionally, install other required packages:
cd openvla
pip install -r experiments/robot/libero/libero_requirements.txt
To start evaluation , run one of the commands below. Each will automatically download the appropriate checkpoint listed above.
# Launch LIBERO-Spatial evals
python experiments/robot/libero/run_libero_eval.py \
--model_family openvla \
--pretrained_checkpoint <PATH TO YOUR TPO MODEL> \
--task_suite_name libero_spatial \
--center_crop True
# Launch LIBERO-Object evals
python experiments/robot/libero/run_libero_eval.py \
--model_family openvla \
--pretrained_checkpoint <PATH TO YOUR TPO MODEL> \
--task_suite_name libero_object \
--center_crop True
# Launch LIBERO-Goal evals
python experiments/robot/libero/run_libero_eval.py \
--model_family openvla \
--pretrained_checkpoint <PATH TO YOUR TPO MODEL> \
--task_suite_name libero_goal \
--center_crop True
# Launch LIBERO-10 (LIBERO-Long) evals
python experiments/robot/libero/run_libero_eval.py \
--model_family openvla \
--pretrained_checkpoint <PATH TO YOUR TPO MODEL> \
--task_suite_name libero_10 \
--center_crop True
If you find our code or models useful in your work, please cite our paper:
@misc{zhang2024grape,
title={GRAPE: Generalizing Robot Policy via Preference Alignment},
author={Zijian Zhang and Kaiyuan Zheng and Zhaorun Chen and Joel Jang and Yi Li and Chaoqi Wang and Mingyu Ding and Dieter Fox and Huaxiu Yao},
year={2024},
eprint={2411.19309},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2411.19309},
}