Highly modularized implementation of popular deep RL algorithms by PyTorch. My principal here is to reuse as much components as I can through different algorithms, use as less tricks as I can and switch easily between classical control tasks like CartPole and Atari games with raw pixel inputs.
Implemented algorithms:
- Deep Q-Learning (DQN)
- Double DQN
- Dueling DQN
- Async Advantage Actor Critic (A3C)
- Async One-Step Q-Learning
- Async One-Step Sarsa
- Async N-Step Q-Learning
- Continuous A3C
- Distributed Deep Deterministic Policy Gradient (Distributed DDPG, aka D3PG)
- Hybrid Reward Architecture (HRA)
- Parallelized Proximal Policy Optimization (P3O, similar to DPPO)
- Action Conditional Video Prediction
Curves for CartPole are trivial so I didn't place it here. There isn't any fixed random seed.
The network and parameters here are exactly same as the DeepMind Nature paper. Training curve is smoothed by a window of size 100. All the models are trained in a server with Xeon E5-2620 v3 and Titan X. For Breakout, test is triggered every 1000 episodes with 50 repetitions. In total, 16M frames cost about 4 days and 10 hours. For Pong, test is triggered every 10 episodes with no repetition. In total, 4M frames cost about 18 hours.
I referred this repo.
The network I used here is a smaller network with only 42 * 42 input, alougth the network for DQN can also work here, it's quite slow.
Training of A3C took about 2 hours (16 processes) in a server with two Xeon E5-2620 v3. While other async methods took about 1 day. Those value based async methods do work but I don't know how to make them stable. This is the test curve. Test is triggered in a separate deterministic test process every 50K frames.
I referred this repo for the parallelization.
For continuous A3C and DPPO, I use fixed unit variance rather than a separate head, so entropy weight is simply set to 0. Of course you can also use another head to output variance. In that case, a good practice is to bound your mean while leave variance unbounded, which is also included in the implementation.
Extra caution is necessary when computing gradients. The repo I referred for DDPG is wrong in computing the deterministic gradients at least at this commit. Theoretically I believe that implementation should work, but in practice it doesn't work. Even this is PyTorch you need to manually deal with gradients in this case. DDPG is not very stable.
Setting the number of workers to 1 will reduce the implementation to exact DDPG. I have to adopt the most straightforward distribution method, as P3O and A3C style distribution doesn't work for DDPG. The figures were done with 6 workers.
The difference between my implementation and DeepMind's DPPO is:
- PPO stands for different algorithms.
- I use a much simpler A3C-like synchronization protocol.
The body of PPO is based on this repo. However that implementation has two critical bugs at least at this commit. Its computation of the clipped loss is correct with one-dimensional action by accident, but is wrong with high-dimensional action. And its computation of entropy is wrong in any case.
I use 8 threads and a two tanh hidden layer network, each hidden layer has 64 hidden units.
Left: One-step prediction Right: Ground truth
Prediction is sampled after 110K iterations and I only implemented one-step training
Tested in macOS 10.12 and CentO/S 6.8
- Open AI gym
- Roboschool (Optional)
- PyTorch v0.3.0
- Python 2.7 or Python 3.6
- TensorboardX
dataset.py
: generate dataset for action conditional video prediction
main.py
: all other algorithms
- Human Level Control through Deep Reinforcement Learning
- Asynchronous Methods for Deep Reinforcement Learning
- Deep Reinforcement Learning with Double Q-learning
- Dueling Network Architectures for Deep Reinforcement Learning
- Playing Atari with Deep Reinforcement Learning
- HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
- Deterministic Policy Gradient Algorithms
- Continuous control with deep reinforcement learning
- High-Dimensional Continuous Control Using Generalized Advantage Estimation
- Hybrid Reward Architecture for Reinforcement Learning
- Trust Region Policy Optimization
- Proximal Policy Optimization Algorithms
- Emergence of Locomotion Behaviours in Rich Environments
- Action-Conditional Video Prediction using Deep Networks in Atari Games