Pytorch reimplementation of Pagliarini et al. (2021) "What does the Canary Say? Low-Dimensional GAN Applied to Birdsong"
First, clone the repository on your computer.
We recommend using a virtual environment when using this tool. You may use virtualenv
or pyenv
for instance. If using conda
, pay attention when
performing the next steps, as some package requirements may differ.
The code should run with python>=3.9<=3.11
.
After cloning the repository and creating a virtual environment, open a terminal and place yourself at the repository root. Activate your virtual environment (this step may differ from one virtual environment manager to another).
Now, run:
pip install -e .
This will install canarygan
along with its dependencies, and add canarygan
to your PATH
. You will now be able
to use canarygan
command line interface.
In some cases, you might want to install requirements manually. This is required if you need a specific version of Pytorch to run on your machine.
Package requirements may be found in the requirements.txt
and the pyproject.toml
files.
You can install requirements by running the following command within the repository and a virtual environment:
pip install -r requirements.txt
Modify this file, or use pip
or conda
if you wish to install packages differently.
You may still try to run pip install -e .
after this step to add canarygan
to your PATH
. If it does not work, replace all
following invocations of canarygan
command line interface with python -m canarygan
.
We let torch
package requirement pretty loose on purpose, but can not ensure this tool will work on any machine and operating system.
This tool was developed using Pytorch 2.0.3, and runs on different Linux operating systems, equipped with different hardware. It worked using Nvidia GPUs (Quadro 4000TX, P100, A100) with CUDA 11.8.
canarygan
provides a CLI to perform major operations, such as training the GAN and generating sounds.
You can display a short description of the interface by running:
canarygan --help
You should get the following output:
❯ canarygan --help
Usage: canarygan [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
build-decoder-dataset Preprocess dataset for decoder training.
generate Generate canary syllables using a trained GAN...
inception Compute inception score.
sample Randomly sample GAN latent space and save...
train-decoders ESN, kNN and SVM decoders training.
train-gan Train a CanaryGAN instance.
train-inception Distributed canaryGAN inception scorer training...
umap Make many plots displaying UMAP projections of...``
Note: if you installed canarygan
manually, you may have to type python -m canarygan --help
instead.
Two datasets are required: one to train the GAN and the other to train the decoders.
Both datasets must be WAV files containing 1 second of audio, sampled at 16000Hz at least. If the sampling rate is higher, it will be reduced to 16000Hz automatically. These 1 second of audio must hold a unique birdsong syllable rendition. Original results were obtained over a dataset of 16 different types of syllables, sampled from a single canary individual.
Audio files must be organized in folders named after the canary syllable labels. We recommend that each folder contain the same number of audio samples. In the original paper, 1000 samples per type of syllable were used to train the GAN and the decoders.
The dataset structure hence resembles this:
data_dir/
|
|- label_1/
| |- ...
|- label_2/
| |- ...
...
|- label_n/
|- audio_1.wav
|- audio_2.wav
|- audio_3.wav
...
|- audio_m.wav
GAN dataset: GAN training dataset must contain real samples only.
Decoder dataset: Decoder training dataset must include GAN training dataset, and append GAN generated songs. In the original work by Pagliarini et al., 5 classes of audio were added: audio samples generated by the GAN at training epochs 15, 30, 45, and last epoch, and white noise samples. These samples were labeled "EARLY15", "EARLY30", "EARLY45", "OT" (Over Training"), and "WN" (White Noise). They may also sometimes be referred to altogether as the "X" class.
GAN training loop was implemented using Lightning. Lightning allows distributed training strategies, using multiple GPUs on multiple compute nodes. However, this training loop should also work locally on a modern powerful computer.
From your terminal, run canarygan train-gan --help
to display all options of GAN training.
To train the GAN on a single machine equipped with a single GPU or CPU, you may simply launch:
canarygan train-gan -d data_dir/ -s save_dir/
The -d
option is used to specify the dataset root directory, which must be structured as previously explained. The -s
option specifies the save directory,
where all model checkpoints and training logs will be saved during training. If this directory does not exist, it will be created at
runtime.
When training in a distributed setup, a bunch of options may be used to attribute compute resources to canarygan
.
canarygan train-gan -N 1 -G 2 -c 12 -d data_dir/ -s save_dir/
The -N
option defines the number of computing nodes attributed to this GAN training process. This is only useful when training on
a cluster. If using a single machine like your personal computer, keep this value to 1.
The -G
option defines the number of GPU devices that may be used from training, per node. Here, if we consider training on a machine
equipped with 2 GPUs, we set -G
to 2.
The -c
option sets the number of CPU processes attached to the training loop. This is mainly used to leverage data loading and
unloading from and to the GPUs. Here, we launch 12 processes per node.
Distributed training may dramatically speed up training. Using 4 Nvidia P100 GPUs on 2 compute nodes, 1000 epochs of training with a 16000 samples dataset would take approximately 30h.
By default, logs are written every 100 training steps. Logs may be displayed using Tensorboard:
tensorboard --logdir save_dir/logs/tensorboard
Tensorboard is part of canarygan
requirements and will be installed by default on your computer when installing canarygan
.
Logs are also saved as CSV files in save_dir/logs/csv
.
You may change the logging frequency using the --log-every-n-steps
option.
By default, model checkpoints are saved to disk every 15 epochs. You may change the checkpointing frequency using the
--save-every-n-epochs
option.
Two different sorts of checkpoints are being produced: in save_dir/checkpoints/all
, you may find all training checkpoints
saved every N epoch, while save_dir/checkpoints/last
saves an image of the last checkpoint saved.
The last checkpoint saved may be used to resume training after an interruption, using the --resume
flag:
canarygan train-gan -d data_dir/ -s save_dir/ --resume
When training several instances and saving them under the same save_dir
directory, each instance will be
automatically identified by an integer ID, or an ID provided by the user using the --version
option.
By default, --version=infer
, meaning that instances will be identified by an integer ID that will be
automatically incremented when launching a new training process, unless using --resume
, which will resume
training the last trained instance.
Once a trained GAN instance is available, syllables can be generated by providing latent vectors or randomly sampling the GAN latent space.
We recommend saving the sampled GAN latent space vectors to disk to increase results reproducibility.
These vectors must be stored in a .npy
), where
To generate these samples, you may use:
canarygan sample -s save_dir/ -n 10000 -d 3
This will create a .npy
file in save_dir/
containing 10000 3-dimensional vectors. By default, the vector values
are uniformly distributed between -1 and 1.
You may change the distribution parameters using the --dist
and --dist-params
options. Run canarygan sample --help
to access documentation.
To generate canary syllable samples, run:
canarygan generate -x path/to/gan.ckpt -n 10000 -s save_dir/
The -x
option is required and must point to a GAN checkpoint file obtained through training.
The -s
option is also required and provide an endpoint directory for the generated audios. They will
be stored as compressed Numpy archives (.npz
) files in this directory. These compressed archives contains
the audio signal (in the subfile x.npy
) and other metadata such as the corresponding latent vector (z.npy
).
This archive may be loaded using d = numpy.load(archive_path)
, and subfiles accessed using d["x"]
or d["z"]
.
The -n
option is necessary is you do not wish to provide any pre-computed latent vectors to the script. In that
case, this option specifies the number of latent vectors to randomly sample, and thus the number of generated audios.
If you used canarygan sample
and wish to generate sounds from precomputed latent vectors, use:
canarygan generate -x path/to/gan.ckpt -z path/to/vectors.npy -s save_dir/
The -z
flag must point towards the Numpy archive storing the latent vectors on disk.
A decoder is used to infer the syllable type of a given audio sample. It is mainly used to classify the production of the GAN and assert its quality in producing realistic bird sounds.
Three lightweight decoders classes are provided in canarygan
: Echo State Networks (ESNs), k-Nearest Neighbors (KNN)
and Support Vector Machines (SVM). These simple classifiers display good performance at sorting single
syllables from the GAN training dataset while remaining simple, fast, and easy to train. They operate on
preprocessed representation of audio signals. Several preprocessing methods are available, based on
extracting spectral features from the sound. We recommend using deltas
method, which consists of
computing the first and second derivatives of the audio MFCCs, a compressed time-frequency representation of the
sound.
The decoder training dataset usually contains generated samples from the GAN early training steps. These samples act as a "garbage class", where we expect all poorly realistic sounds to be sorted. Determining these classes may depend on your GAN performance. Notebooks provided in this repository should be used to graphically assess GAN quality.
We recommend using samples from epochs 15, 30, and 45 as garbage examples, as we can safely assume that GAN has not reached convergence in the first 50 epochs of training. In addition, we also added white noise samples to discard samples with too much noise or entropy.
As preprocessing might take time, we also recommend performing data transformations once before training, using:
canarygan build-decoder-dataset -d data_dir/ -s save_dir/
This command will take the dataset in data_dir/
and output its preprocessed representation in save_dir
.
Running canarygan build-decoder-dataset --help
will display all available preprocessing options. Default
options have been set to the one giving the best results in our setup.
The preprocessed dataset will be saved as Numpy archive file storing training and test data (dataset split will
occur at this step and may be modified using the --split
option), alongside a YAML file where all preprocessing
parameters will be saved.
Once the dataset has been preprocessed, you may run the decoders training loop using:
canarygan train-decoders -s save_dir/ -p preprocessed_dir/
where save_dir/
will hold trained model checkpoints, saved as a pickled file, alongside training and testing metrics, and preprocessed_dir/
holds the preprocessed dataset (Numpy archive and YAML file).
By default, all available decoders will be trained. If you wish to train only a subset of decoders,
you can do so by using the -m
flag:
canarygan train-decoders -s save_dir/ -p preprocessed_dir/ -m esn -m knn
This will only train an ESN and a KNN decoder.
If you have not preprocessed data beforehand, you may also point the -d
flag towards the
raw audio dataset, and use all other options to change the preprocessing parameters. This, however,
is not recommended.
After producing some trained decoder, GAN-generated sounds may now be labeled.
You can also generate and decode sounds at the same time by giving decoder checkpoints as
input to canarygan generate
using the -y
option:
canarygan generate -x path/to/gan.ckpt -z path/to/vectors.npy -y path/to/decoder1 -y path/to/decoder2 -s save_dir/
This will add some y
columns in the generated sound Numpy archive. As preprocessing will happen at the same
time as decoding, you may give preprocessing parameters for decoders using the YAML file produced by canarygan build-decoder-dataset
through the -p
option:
canarygan generate -x path/to/gan.ckpt -z path/to/vectors.npy \
-y path/to/decoder1 \
-y path/to/decoder2 \
-s save_dir/ \
-p path/to/preprocessing.yml
To benefit from another point of view on GAN generation quality, UMAP projection can be applied to generated sounds to obtain an unsupervised glimpse into sound plausibility.
canarygan umap -d data_dir/ -g generated_audio_dir/ --epoch x --version y -s save_dir/
Will produce various plots displaying UMAP sound projections of real and generated canary syllables, colored by inferred class. Syllable class will also be computed using HDBSCAN clustering on UMAP projections, and saved into generated sound Numpy archives.
Here, the -d
option may be used to point toward the GAN training dataset (containing ground
truth samples of syllables), and the -g
option must point towards the directory holding all
generated sounds (the save_dir
of the canarygan generate
script). This directory may hold
many different generations from different GAN instances at different training epochs. You can
choose which version and epoch to plot using the --epoch
and --version
parameters, where
x
and y
are integers for training epoch and version ID. All plots will then be saved in
the save_dir/
directory.