Skip to content

Commit

Permalink
Merge pull request #1 from gperdrizet/CAA
Browse files Browse the repository at this point in the history
CAA
  • Loading branch information
gperdrizet authored Aug 22, 2024
2 parents 6c197ec + 695063c commit aac82ac
Show file tree
Hide file tree
Showing 2,024 changed files with 648 additions and 28,004 deletions.
15 changes: 15 additions & 0 deletions CAA_artifacts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Skylines CAA artifacts

This project was originally conceived in ~2020 as a method to gain intuition about convolution, GANs and generated imagery via experimentation. The neural networks created was designed to be trainable on easily accessible hardware (e.g. NVIDIA K80), with small datasets and to illustrate what sort of 'imagination' a machine is capable of. The original generation targets included: flowers, city skylines, nebulae, hands, fruit and birds.

The project was 'completed' and shared via Instagram and Twitter in 2020 and 2021 under the name random_praxis_memory (user: @floraxx on both platforms).

In 2024, the project was revived in collaboration with Assistant Professor of Fine Art, Laura Perdrizet (University of Mount St. Vincent). Professor Perdrizet used the GANN's output to incite critical dialogue about 'AI' and to incorporate machine learning as a creative tool and as a source of art material in her studio art courses.

The resulting artworks and analysis are impressive and aspirational, well illustrating the potential for advanced machine learning techniques in human creative endeavors. The collaboration is ongoing and preliminary results were presented by Professor Perdrizet at the 2024 College Art Association conference in Chicago, IL.

This branch of the skylines repository serves as an archive of artifacts created for and relevant to the collaboration.

## Links

1. [2024 CAA presentation](http://www.lauraelaynemiller.com/research)
18 changes: 18 additions & 0 deletions CAA_artifacts/generation_strategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Generation strategy

Skylines uses an in-house, *de novo* trained convolutional neural network. Both the dataset and model architecture were custom designed and built to illustrate mechanical 'imagination'.

## 1. Data source

Images of city skylines were manually curated from Google image search. The resulting set of images were then scaled to 1024 x 1024 pixels and mirrored to maximize dataset diversity.

## 2. Training

Training was conducted via a generative adversarial strategy. Two models were constructed: one for generation and one for discrimination. The discriminator model takes an image as input and returns a 'yes' or 'no' answer to the question, 'Did this image come from the dataset of real images?'. The generator takes a list of 100 random numbers as input and uses convolution to convert them into a 1024 x 1024 pixel RGB image. The training loop is as follows:

1. The generator is fed a set of randomly generated input points and creates an output image from them.
2. A generated image or a real skyline image is given to the discriminator model which scores how likely the image is to have come from the 'real' city skylines image set.
3. The discriminator's neural net is updated to give better answers - i.e. it is penalized for being wrong and rewarded for being right.
4. The generator's neural net is updated to make better fakes - i.e. if the discriminator was 'fooled' by the generated image the generator is rewarded and if not, it is penalized.

The result is a very large and complex equation which takes a list of 100 numbers and does calculations on them to generate three new sets of 1024 x 1024 numbers. The resulting sets of numbers resemble a city skyline when formatted and displayed as an RGB image.
98 changes: 98 additions & 0 deletions CAA_artifacts/notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Notes

## 1. Thoughts

### 1.1

This type of project is hard to experiment with - each new tweak requires a large amount of both disk space and time:

```text
$ du -sh data/*
14G ./gan_output
400M ./image_datasets
1.8G ./specimens
5.0T ./training_checkpoints
```

That's from just one short-ish run, with one set of configuration variables and was generated over the course of ~4 days.

### 1.2

Each model is a more or less 'good' method to find city-ness in an n=100 vector of random numbers.

### 1.3

The generated cities have a concept of nearness or similarity which is not physical or aesthetic, e.g. city [1.1, 1.2,...] is closer to city [1.2, 1.2,...] than it is to city [1.8, 1.2,...].

### 1.4

The model translates 100 random numbers into 3.2 million numbers which resemble a city when formatted as a jpeg image.

## 2. Training scratch

### 2.1. 2024-02-11 run

The learning rates were initially set at 0.000025. At 7000 batches the learning rates were manually updated to 0.00001. The learning rates were updated again at around 7800 batches to 0.0001 because the models had failed to progress visually. Prior archived runs used 0.0001 with good results.

The models were trained to ~16k batches with hardly any progress. Seems like switching to a fast learning rate after the model had already started to converge was not helpful. In the future, try starting with a large learning rate and then decreasing it as training progresses.

Generated the training video out to about 9700 frames for archival purposes, but didn't keep any other artifacts.

### 2.2. 2024-02-17 run

By about batch 18000 it was apparent that the model was flopping around - it was still generating some interesting results, but not really making progress. It would get better and then worse again on the scale of about 100 batches. So at 19000 batches, training was stopped, the learning rates were adjusted from 0.00005 to 0.000025 and training restarted.

After training to just over 20000 batches it became apparent that halving the learning rate did not help significantly. The models did not make visual progress, the GAN loss skyrocketed and the d2 loss went to zero.

Another issue is the size on disk - model checkpoints are being saved after every batch. The large number of checkpoints occupy almost 10 TB. To train further, additional disk space is needed. The plan is to stop training temporarily and generate frames for training videos for a number of interesting latent points up to 19000 batches. Then, a few earlier model checkpoints can be manually curated for the archive and the rest deleted. This will free up space to train for significantly longer.

#### Training frame sequences

Latent points for training sequences were chosen based on the following specimens:

1. 16500.28 - complete to 19000 frames, video finished
2. 18218.29 - complete to 19000 frames, video finished
3. 18218.3 - complete to 19000 frames, video finished
4. 16500.21 - complete to 19000 frames, video finished
5. 18218.11 - complete to 19000 frames, video finished
6. 18218.6 - complete to 19000 frames, video finished

Ok, now that we have good documentation of and training videos for the current state of the model, let's get rid of some of the earlier checkpoints to free up disk space.

Plan is to just delete checkpoints such that we keep only one every 10 batches instead of every batch. This will give a 10 fold reduction of size on disk, i.e.:

```text
rm -r discriminator_model_f00*1
rm -r discriminator_model_f00*2
rm -r discriminator_model_f00*3
rm -r discriminator_model_f00*4
rm -r discriminator_model_f00*5
rm -r discriminator_model_f00*6
rm -r discriminator_model_f00*7
rm -r discriminator_model_f00*8
rm -r discriminator_model_f00*9
```

Leaving any discriminator model checkpoints which end in '0' and doing the same for the generator models. This gives us some leeway to restart training at and earlier timesteps and preserves some of the partially trained models.

Resulted in a reduction from over 9 TB to around 900 GB. If needed, we can use this approach to reduce the disk space occupied by keeping only very hundredth or thousandth checkpoint.

Today is 2024-03-13, let's resume training on our GTX1070. Here are the run settings from config.py:

```text
GPUS=[
'/job:localhost/replica:0/task:0/device:GPU:0'
]
GPU_PARALLELISM=None
LATENT_DIM=100
DISCRIMINATOR_LEARNING_RATE=0.000025
GENERATOR_LEARNING_RATE=0.000025
GANN_LEARNING_RATE=0.000025
BATCH_SIZE=3
EPOCHS=100000
CHECKPOINT_SAVE_FREQUENCY=1
```

Trained to ~24,000 batches, model made no visual progress. Locked with large d2 loss and zero g loss after the first few batches. Learning rate either too large or too small.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
# skylines

![Skylines banner](https://github.com/gperdrizet/skylines/blob/CAA/CAA_artifacts/2022-03-23_specimens/skylines_banner.jpg)
2 changes: 1 addition & 1 deletion generate_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
# Convenience script to generate augmented data
# from images in raw_images dir.

python ./skylines/data_aug.py
python ./skylines/generate_data.py
10 changes: 5 additions & 5 deletions generate_GANN_images.sh → generate_specimens.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
# Convenience script to generate images from a trained model

# Which model to use/how many images to make
RUN_DATE='2022-03-23'
MODEL_CHECKPOINT=20500
NUM_IMAGES=10
RUN_DATE='2024-02-17'
MODEL_CHECKPOINT=18218
NUM_IMAGES=30

# Set LD_LIBRARY_PATH
export LD_LIBRARY_PATH=`pwd`/.venv/lib/
Expand All @@ -22,10 +22,10 @@ export TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD=38654705664
export TF_CPP_MIN_LOG_LEVEL=2

# Set visible GPUs
export CUDA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=1

# Prevent tensorflow from automapping all GPU memory
export TF_FORCE_GPU_ALLOW_GROWTH=false

# Make images
python ./skylines/generate.py $RUN_DATE $MODEL_CHECKPOINT $NUM_IMAGES
python ./skylines/generate_specimens.py $RUN_DATE $MODEL_CHECKPOINT $NUM_IMAGES
40 changes: 40 additions & 0 deletions make_training_frames.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/bin/bash

# Convenience script to generate video of training
# from model checkpoints and latent point

# Which run date and specimen latent point to use
RUN_DATE='2024-02-17'
SPECIMEN_LATEN_POINT='18218.6'

# Resume or add to a previous frame generation run
RESUME='False'

# Frame number to resume from. Is used as index of model in model
# paths list and number for frame output. This alows the
# generation of squentialy numbered frames from non-sequential
# model snapshots
RESUME_FRAME='0'

# Set LD_LIBRARY_PATH
export LD_LIBRARY_PATH=`pwd`/.venv/lib/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`pwd`/.venv/lib/python3.8/site-packages/tensorrt/

# Increase tcmalloc report threshold to 36 GB
export TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD=38654705664

# Set Tensorflow log level if desired:
# 0 - (all logs shown)
# 1 - filter out INFO logs
# 2 - filter out WARNING logs
# 3 - filter out ERROR logs
export TF_CPP_MIN_LOG_LEVEL=3

# Set visible GPUs
export CUDA_VISIBLE_DEVICES=1

# Prevent tensorflow from automapping all GPU memory
export TF_FORCE_GPU_ALLOW_GROWTH=false

# Make images
python ./skylines/make_training_frames.py $RUN_DATE $SPECIMEN_LATEN_POINT $RESUME $RESUME_FRAME
11 changes: 11 additions & 0 deletions make_training_video.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash

# Generate video from sequence of training stills
RUN_DATE='2024-02-17'
SPECIMEN_LATEN_POINT='18218.6'

# Makes video with frame number annotation
ffmpeg -r 60 -i ./skylines/data/${RUN_DATE}/specimens/${SPECIMEN_LATEN_POINT}_training_sequence/%d.jpg -pix_fmt yuv420p -c:v libx265 -vf "fps=60, drawtext=fontfile=/usr/share/fonts/truetype/dejavu/DejaVuSansMono-Bold.ttf: text='%{frame_num}': fontcolor=white: fontsize=60" ./CAA_artifacts/${RUN_DATE}_${SPECIMEN_LATEN_POINT}_training_frame_number.mp4

# Makes video without frame number annotation
ffmpeg -r 60 -i ./skylines/data/${RUN_DATE}/specimens/${SPECIMEN_LATEN_POINT}_training_sequence/%d.jpg -c:v libx265 -vf fps=60 -pix_fmt yuv420p ./CAA_artifacts/${RUN_DATE}_${SPECIMEN_LATEN_POINT}_training_no_frame_number.mp4
6 changes: 6 additions & 0 deletions setup_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,3 +144,9 @@ This one is pretty specific to our hardware configuration, but might be useful t
# Fast scratch bind mount for skylines project
/mnt/fast_scratch/skylines /mnt/arkk/rpm/skylines/skylines/skylines/data none x-systemd.requires=/mnt/fast_scratch,x-systemd.requires=/mnt/arkk,x-systemd.automount,bind 0 0
```

## 6. Removing large files from git tracking

```text
git filter-branch --tree-filter 'rm -f path/to/big/file' HEAD
```
15 changes: 15 additions & 0 deletions skylines/benchmarking/training_benchmark_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
GPU parallelism,GPUs,Batch size,Training time (sec.)
central storage,2,6,74.6561291217804
central storage,2,6,6.395625829696655
central storage,2,6,6.365304231643677
central storage,2,6,6.44519829750061
central storage,2,6,6.371362924575806
central storage,2,6,6.436369895935059
central storage,2,6,6.361017227172852
central storage,2,6,6.358325719833374
central storage,2,6,6.504878282546997
central storage,2,6,6.458540201187134
central storage,2,6,6.538267612457275
central storage,2,6,6.4554502964019775
central storage,2,6,6.476968050003052
central storage,2,6,6.528106689453125
70 changes: 31 additions & 39 deletions skylines/config.py
Original file line number Diff line number Diff line change
@@ -1,65 +1,57 @@
import os
from datetime import datetime

# Project name and run date
PROJECT_NAME = 'skylines'
CURRENT_DATE = datetime.today().strftime('%Y-%m-%d')

########################################################################
# Option to resume a training run ######################################
########################################################################

RESUME = False
RESUME_RUN_DATE = '2024-02-08'
# Project name
PROJECT_NAME='skylines'
CURRENT_DATE=datetime.today().strftime('%Y-%m-%d')

########################################################################
# Paths and directories ################################################
########################################################################

# Get path to this config file, we will use this
# to define other paths to data files etc.
path = os.path.dirname(os.path.realpath(__file__))
PATH=os.path.dirname(os.path.realpath(__file__))

# Use current date or resume data in file paths as needed

if RESUME == True:
path_date = RESUME_RUN_DATE

elif RESUME == False:
path_date = CURRENT_DATE

IMAGE_DIR = f'{path}/data/image_datasets'
RAW_IMAGE_DIR = f'{IMAGE_DIR}/raw_images'
PROCESSED_IMAGE_DIR = f'{IMAGE_DIR}/training_images'
TRAINING_IMAGE_DIR = PROCESSED_IMAGE_DIR
MODEL_CHECKPOINT_DIR = f'{path}/data/training_checkpoints/{path_date}'
SPECIMEN_DIR = f'{path}/data/specimens/{path_date}'
IMAGE_OUTPUT_DIR = f'{path}/data/gan_output/{path_date}'
IMAGE_DIR=f'{PATH}/data/image_datasets'
RAW_IMAGE_DIR=f'{IMAGE_DIR}/raw_images'
PROCESSED_IMAGE_DIR=f'{IMAGE_DIR}/training_images'
TRAINING_IMAGE_DIR=PROCESSED_IMAGE_DIR
# MODEL_CHECKPOINT_DIR=f'{path}/data/{path_date}/training_checkpoints'
# SPECIMEN_DIR=f'{path}/data/{path_date}/specimens'
# IMAGE_OUTPUT_DIR=f'{path}/data/{path_date}/gan_output'
BENCHMARK_DATA_DIR=f'{PATH}/benchmarking'


########################################################################
# Data related parameters ##############################################
########################################################################

MAX_CONCURRENCY = 2
IMAGE_DIM = 1024
SHUFFLE_BUFFER = 50
MAX_CONCURRENCY=2
IMAGE_DIM=1024
SHUFFLE_BUFFER=50

########################################################################
# dc-gann parameters ###################################################
########################################################################

GPUS = [
GPUS=[
'/job:localhost/replica:0/task:0/device:GPU:0',
'/job:localhost/replica:0/task:0/device:GPU:1'
]

GPU_PARALLELISM = 'central storage'
LATENT_DIM = 100
DISCRIMINATOR_LEARNING_RATE = 0.0001 #0.00005
GENERATOR_LEARNING_RATE = 0.0001 #0.00005
GANN_LEARNING_RATE = 0.0001 #0.00005
BATCH_SIZE = int(2 * len(GPUS))
EPOCHS = 100000

CHECKPOINT_SAVE_FREQUENCY = 5
# Note: skylines run 1 and 2 used 4 GPUs so actual batch size was
# 4x3 = 12 rather than 2*3 = 6. This seems to be the only major
# difference between the original runs and now.
#
# Learning rate: skylines.1 = 0.00005, skylines.2 = 0.0001

GPU_PARALLELISM='central storage'
LATENT_DIM=100
DISCRIMINATOR_LEARNING_RATE=0.000025
GENERATOR_LEARNING_RATE=0.000025
GANN_LEARNING_RATE=0.000025
BATCH_SIZE=int(3 * len(GPUS))
EPOCHS=100000

CHECKPOINT_SAVE_FREQUENCY=1
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit aac82ac

Please sign in to comment.