Updated README:

* Details on how to clone just the contrastive_rl subdirectory. * Installation instructions for GPU support. * Fixed image link so that it renders using Github's Markdown. Fixed bug in offline RL experiment to terminate after a certain number of gradient steps. (Thanks to Chongyi Zheng for catching this!) Fixed bug where offline RL experiments didn't use the behavioral cloning loss. (Thanks to Chongyi Zheng for catching this!) PiperOrigin-RevId: 462194179
Yuxing0610 · Jul 20, 2022 · 7836464 · 7836464
1 parent 654e36a
commit 7836464
Show file tree

Hide file tree

Showing 4 changed files with 27 additions and 11 deletions.
diff --git a/contrastive_rl/README.md b/contrastive_rl/README.md
@@ -1,10 +1,10 @@
-# Contrastive Learning as Goal-Conditioned Reinforcement Learning
+# [Contrastive Learning as Goal-Conditioned Reinforcement Learning](https://arxiv.org/pdf/2206.07568.pdf)
+<p align="center"><img src="contrastive_rl.png" width=60%></p>
 
 <p align="center"> Benjamin Eysenbach, &nbsp; Tianjun Zhang, &nbsp; Ruslan Salakhutdinov &nbsp; Sergey Levine</p>
 <p align="center">
-   <a href="https://arxiv.org/abs/2206.07568">paper</a>
+   Paper: <a href="https://arxiv.org/pdf/2206.07568.pdf">https://arxiv.org/pdf/2206.07568.pdf</a>
 </p>
-![diagram of contrastive RL](contrastive_rl.png)
 
 *Abstract*: In reinforcement learning (RL), it is easier to solve a task if given a good representation. While _deep_ RL should automatically acquire such good representations, prior work often finds that learning representations in an end-to-end fashion is unstable and instead equip RL algorithms with additional representation learning parts (e.g., auxiliary losses, data augmentation). How can we design RL algorithms that directly acquire good representations? In this paper, instead of adding representation learning parts to an existing RL algorithm, we show (contrastive) representation learning methods can be cast as RL algorithms in their own right. To do this, we build upon prior work and apply contrastive representation learning to action-labeled trajectories, in such a way that the (inner product of) learned representations exactly corresponds to a goal-conditioned value function. We use this idea to reinterpret a prior RL method as performing contrastive learning, and then use the idea to propose a much simpler method that achieves similar performance. Across a range of goal-conditioned RL tasks, we demonstrate that contrastive RL methods achieve higher success rates than prior non-contrastive methods. We also show that contrastive RL outperforms prior methods on image-based tasks, without using data augmentation or auxiliary objectives.
 
@@ -21,11 +21,11 @@ This repository contains the new algorithms, some of the baselines, and the asso
 
 ### Installation
 
-1. Create an Anaconda environment: `conda create -n contrastive_rl python=3.9
-   -y`
-2. Activate the environment: `conda activate contrastive_rl`
-3. Install the dependencies: `pip install -r requirements.txt --no-deps`
-4. Check that the installation worked: `./run.sh`
+1. Clone the `contrastive_rl` repository: `svn export https://github.com/google-research/google-research/trunk/contrastive_rl; cd contrastive rl`
+2. Create an Anaconda environment: `conda create -n contrastive_rl python=3.9 -y`
+3. Activate the environment: `conda activate contrastive_rl`
+4. Install the dependencies: `pip install -r requirements.txt --no-deps`
+5. Check that the installation worked: `chmod +x run.sh; ./run.sh`
 
 ### Running the experiments
 

diff --git a/contrastive_rl/contrastive/distributed_layout.py b/contrastive_rl/contrastive/distributed_layout.py
@@ -284,7 +284,11 @@ def actor(self, random_key, replay,
                                             logger, observers=self._observers)
 
   def coordinator(self, counter, max_actor_steps):
-    return lp_utils.StepsLimiter(counter, max_actor_steps)
+    if self._builder._config.env_name.startswith('offline_ant'):  # pytype: disable=attribute-error, pylint: disable=protected-access
+      steps_key = 'learner_steps'
+    else:
+      steps_key = 'actor_steps'
+    return lp_utils.StepsLimiter(counter, max_actor_steps, steps_key=steps_key)
 
   def build(self, name='agent', program = None):
     """Build the distributed agent topology."""

diff --git a/contrastive_rl/contrastive/learning.py b/contrastive_rl/contrastive/learning.py
@@ -255,6 +255,16 @@ def actor_loss(policy_params,
           q_action = jnp.min(q_action, axis=-1)
         actor_loss = alpha * log_prob - jnp.diag(q_action)
 
+        assert 0.0 <= config.bc_coef <= 1.0
+        if config.bc_coef > 0:
+          orig_action = transitions.action
+          if config.random_goals == 0.5:
+            orig_action = jnp.concatenate([orig_action, orig_action], axis=0)
+
+          bc_loss = -1.0 * networks.log_prob(dist_params, orig_action)
+          actor_loss = (config.bc_coef * bc_loss
+                        + (1 - config.bc_coef) * actor_loss)
+
       return jnp.mean(actor_loss)
 
     alpha_grad = jax.value_and_grad(alpha_loss)

diff --git a/contrastive_rl/lp_contrastive.py b/contrastive_rl/lp_contrastive.py
@@ -115,6 +115,9 @@ def main(_):
       'use_random_actor': True,
       'entropy_coefficient': None if 'image' in env_name else 0.0,
       'env_name': env_name,
+      # For online RL experiments, max_number_of_steps is the number of
+      # environment steps. For offline RL experiments, this is the number of
+      # gradient steps.
       'max_number_of_steps': 1_000_000,
       'use_image_obs': 'image' in env_name,
   }
@@ -162,14 +165,13 @@ def main(_):
   # use this mainly for debugging.
   if FLAGS.debug:
     params.update({
-        'min_replay_size': 10_000,
+        'min_replay_size': 2_000,
         'local': True,
         'num_sgd_steps_per_step': 1,
         'prefetch_size': 1,
         'num_actors': 1,
         'batch_size': 32,
         'max_number_of_steps': 10_000,
-        'samples_per_insert_tolerance_rate': 1.0,
         'hidden_layer_sizes': (32, 32),
     })