This repository contains the Python code for reproducing our experiments conducted as part of the paper
Improving Nonlinear Projection Heads using Pretrained Autoencoder Embeddings
by Andreas Schliebitz, Heiko Tapken and Martin Atzmueller.
This project uses poetry
for managing Python dependencies (see pyproject.toml). Follow the steps below to install the code from this repository as a standalone simclr_ae
Python package:
- Install Poetry:
curl -sSL https://install.python-poetry.org | python3 -
- Create and activate a virtual environment:
poetry shell
- Install the requirements:
poetry lock
poetry install
This project is mainly subdivided into two experiments. Within the first experiment, we train the autoencoder embeddings used for replacing the input layer of SimCLR's default projection head. In our second experiment, we train and evaluate our modified projectors as part of the SimCLR framework following standard protocol.
In order to reproduce our results, you will have to prepare the following five image classification datasets in a way that Torchvision's dataset
module can load them:
Note: We advise you to download and extract these datasets into this project's datasets
directory. We also recommend the use of MLflow Tracking to record all training and evaluation runs. As an alternative, we additionally implement tracking via Lightning's CSVLogger
by default.
After that, clone this repository to a location of your choice:
git clone https://github.com/andreas-schliebitz/simclr-ae.git \
&& cd simclr-ae
First, train the 15 autoencoder embeddings using varying latent dimensions (128, 256, 512). We'll use these embeddings in the next section to perform our SimCLR training and evaluation runs.
-
Navigate into the directory of the
ae
experiment:cd simclr_ae/experiments/ae
-
Optional: If applicable, provide your MLflow Tracking credentials in the
ae
experiment's.env
file. If you've placed the datasets into a different directory, changeDATASET_DIR
to that path. -
Execute the experiment's
run_experiments.sh
helper script. If you have multiple GPU's at your disposal, specify the GPU's ID as first parameter, otherwise use0
as the ID of your single GPU. The second parameter can either be a comma separated list of latent dimensions or a single latent dimension. By default, each GPU trains the autoencoder with the specified number of latent dimensions on all datasets:# Train autoencoder on GPU 0 with all three latent dimensions ./run_experiments.sh 0 128,256,512 # Train autoencoder on three GPUs, parallelizing over latent dimensions ./run_experiments.sh 0 128 ./run_experiments.sh 1 256 ./run_experiments.sh 2 512
-
Verify that all model checkpoints, hyperparameters and metrics are written into the
logs
directory.
-
Navigate into the directory of the
simclr_ae
experiment:cd simclr_ae/experiments/simclr_ae
-
Optional: If applicable, provide your MLflow Tracking credentials in the
simclr_ae
experiment's.env
file. If you've placed the datasets into a different directory, changeDATASET_DIR
to that path. -
Due to the run IDs of each pretrained autoencoder embedding being randomly generated, you'll have to adapt the IDs in the
run_experiments.sh
helper script of thesimclr_ae
experiment for each dataset (variablesAE_WEIGHTS_128_PATH
,AE_WEIGHTS_256_PATH
andAE_WEIGHTS_512_PATH
). The script will throw an error if no pretrained autoencoder checkpoint with matching latent dimensions is found for a given dataset. -
Execute the experiment's
run_experiments.sh
helper script. As the first argument, provide your GPU's ID followed by the second argument being the latent dimension of SimCLR's projection space (32, 64, 128).# Train on single GPU with single latent dimension ./run_experiments.sh 0 32 # Train on three GPUs with different latent dimensions ./run_experiments.sh 0 32 ./run_experiments.sh 1 64 ./run_experiments.sh 2 128
-
Once again, verify that all model checkpoints, hyperparameters and metrics are written into the
logs
directory.
You can now verify our results by either inspecting the CSV files in the logs
directory of the ae
and simclr_ae
experiment or by visiting the web interface of your MLFlow Tracking instance. As a basis for comparison, we provide our MLflow runs as CSV exports in the results
directory of each experiment.