This codebase implements learning algorithms and experiments from Behavior Regularized Offline Reinforcement Learning.
If you use this codebase for your research, please cite the paper:
title={Behavior Regularized Offline Reinforcement Learning},
author={Wu, Yifan and Tucker, George and Nachum, Ofir},
journal={arXiv preprint arXiv:1911.11361},
- Behavior regularized actor critic framework for continuous control environments.
- Obtain partially trained policies by training online.
- Customizable data collection.
- Customizable offline algorithmic components: different divergence for regularization, value penalty of policy regularization, Q-ensemble.
- Pretrained behavior and cloned policies used in the paper.
The recommended way to set up these experiments is via a virtualenv.
sudo apt-get install python-pip
python -m pip install --user virtualenv
python -m virtualenv ~/env
source ~/env/bin/activate
Then install the project dependencies in that virtualenv (you will need to separately download appropriate MuJoCo files):
pip install -r requirements.txt
Augment your python path:
Go to behavior_regularized_offline_rl/brac as your working directory.
The first step is to get partially trained policies using Soft Actor Critic (SAC).
python -m train_online \
--sub_dir=0 \
--env_name=HalfCheetah-v2 \
--eval_target=4000 \
--agent_name=sac \
--total_train_steps=500000 \
--gin_bindings="train_eval_online.model_params=(((300, 300), (200, 200),), 2)" \
--gin_bindings="train_eval_online.batch_size=256" \
--gin_bindings="train_eval_online.optimizers=(('adam', 0.0005),)"
Where eval_target is the performance threshold for saving the partially trained policy.
This will save a partially trained policy at
You may also specify --root_dir to replace '$HOME/tmp/offlinerl/policies' with other directories.
You can view training curves by launching a tensorboard on $HOME/tmp/offlinerl/policies/{ENV_NAME}/sac/{SUB_DIR} or any parent directory.
The next step is to collect data using the paritally trained policy. First, a data config file is needed (to specify which policies to use), see configs/ as an example, and see for more information. You can customize any data collection by writing a dcfg_{DATA_NAME}.py.
To collect data, run
python -m collect_data \
--sub_dir=0 \
--env_name=HalfCheetah-v2 \
--data_name=example \
--config_file=dcfg_example \
where 'example' can be replaced by any DATA_NAME.
This will save a policy at
Then you can train agents on these collected datasets by running
Supported agents are bc (behavior cloning), bcq, brac_primal, brac_dual.
For brac_primal, behavior cloning needs to be run first. Then you need to specify the save behavior policy checkpoint file to train brac_primal agents. See the following example:
python -m train_offline \
--sub_dir=$B_SUB_DIR \
--env_name=$ENV_NAME \
--agent_name=bc \
--data_name=$DATA_NAME \
--total_train_steps=300000 \
--n_train=$N_TRAIN \
--seed=$SEED \
--gin_bindings="train_eval_offline.model_params=((200, 200),)" \
--gin_bindings="train_eval_offline.batch_size=256" \
--gin_bindings="train_eval_offline.optimizers=(('adam', 5e-4),)"
This will save a behavior policy at
Then train a brac_primal agent with pretrained behavior policy
VALUE_PENALTY=True # False for policy regularization
python -m brac.train_offline \
--sub_dir=SUB_DIR \
--env_name=$ENV_NAME \
--agent_name=brac_primal \
--data_name=$DATA \
--total_train_steps=500000 \
--gin_bindings="brac_primal_agent.Agent.behavior_ckpt_file='$B_CKPT'" \
--gin_bindings="brac_primal_agent.Agent.alpha=$ALPHA" \
--gin_bindings="brac_primal_agent.Agent.value_penalty=VALUE_PENALTY" \
--gin_bindings="train_eval_offline.model_params=(((300, 300), (200, 200),), 2)" \
--gin_bindings="train_eval_offline.batch_size=256" \
--gin_bindings="train_eval_offline.optimizers=(('adam', 1e-3), ('adam', $PLR), ('adam', 1e-3))"
This will save training and testing logs, checkpoints, etc, at
You may also specify --root_dir to replace '$HOME/tmp/offlinerl/learn' with other directories.
You can view training curves by launching a tensorboard on $HOME/tmp/offlinerl/learn/{ENV_NAME}/{DATA_NAME}/n{N_TRAIN}/{AGENT_NAME}/{SUB_DIR}/{SEED}
or any parent directory.
For brac_primal and brac_dual, two policies are evaluated (can be seen in tensorboard): 'main' for simply taking the output from the policy network. 'max_q' for sampling multiple actions and take the max according to the learned q function.
This repository includes a number of saved policies in the trained_policies
directory. For each environment, there is a SAC-trained policy used to collect
the data, named agent_partial_target*
. To generate the data used in the paper,
simply use the script with the appropriate DATA_NAME in
pure, eps1, eps3, gaussian1, gaussian3.