Long-term collection of RL practice in the framework of Stable Baselines 3. I will make sure that each example is not the same, each has different processing details (except "hello world" enviroments: Cartpole and Lunarlander).
Lastest version of SB3: Installation.
Experimental features based on SB3: SB3 Contrib.
Anything about Gym enviroment: Gym Documentation.
Cartpole is a classical Gym enviroment (for details see here).
See codes here.
Episode length | Episode reward |
---|---|
Lunarlander is a classical Gym enviroment, aiming at rocket trajectory optimization (for details see here).
See codes here.
Episode length | Episode reward |
---|---|
Gridworld is modified from a custom Gym enviroment (for details see here), where an episode ends when the agent reaches the destination.
See codes here.
Random step | PPO agent |
---|---|
Maze is a 2d gridworld-like enviroment.
- Masking invalid actions greatly speeds up the training process of neural network.
- 1d observation is applied though it's natural to use a image-like observation.
The reason is, SB3
CnnPolicy
require image data to have a minimum size of 36x36. Lukily, a flattened observation still works well.
See codes here.
Episode length | Episode reward |
---|---|
Random step | PPO agent |
---|---|
Can the agent make correct decisions if the evaluation environment is different from the training environment? Applying data augment in training may help. In this case, the evaluation enviroment is a symmetric equivalent transformation of the training enviroment. An easy idea is generating equivalent batch datas based on the original batch data.
In order to achieve this goal, I build a custom callback and handle the rollout buffer data directly (See how to customize callback here) . We need to handle 9 kinds of data:
- Expand
observations
,actions
andaction_masks
. - Share
episode_starts
andrewards
. - Recompute
values
andlog_probs
. - Recompute
returns
andadvantages
.
See codes here.
Training enviroment | Evaluation enviroment |
---|---|
Something interesting is, while the agent performed perfectly in the training environment, it got stuck in the evaluation environment.
KL divergence explains this, large value of approx_kl
seems to indicate that the network is not fully fitted:
total_steps = 8e4 | total_steps = 15e4 |
---|---|
approx_kl = 2.1942 | approx_kl = 0.1253 |