Skip to content

Commit

Permalink
WNet encoder graph
Browse files Browse the repository at this point in the history
  • Loading branch information
dorianb committed Dec 17, 2018
0 parents commit 9d3d909
Show file tree
Hide file tree
Showing 63 changed files with 6,422 additions and 0 deletions.
15 changes: 15 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@

build
venv
*.log
*.json
.idea

data/*
!data/.gitkeep

metadata/summaries/*
!metadata/summaries/.gitkeep

metadata/checkpoints/*
!metadata/checkpoints/.gitkeep
165 changes: 165 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Machine learning toolkit

This package is a personal project aimed for research purpose. It covers machine learning models implementation and
processing functions. These models are implemented and deployed mainly using Google technologies as Tensorflow and
Google Cloud Platform utilities.

Python 2.7 has been chosen to ensure components compatibility. The project is structured using Distutils as a build
automation tool. Test are implemented for each module using unittest.

## Installation procedure

Installation of third party librairies
```
$ pip install -r requirements.txt
```

Installation of modules
```
$ python setup.py install
```

## Packages

### cloud_tools

The package cloud_tools allows to deploy complex machine learning pipelines into the cloud.

#### /gcp

Utilities for Google Cloud Platform machine learning. Processing is realised using Google DataFlow (apache beam) and
model training and prediction are realised using Google ML Engine.

##### /example

Transfer learning with flower dataset: https://cloud.google.com/ml-engine/docs/tensorflow/flowers-tutorial

The example shows how to use dataflow and ml engine to preprocess image data and then apply a classifier model. The
purpose is to classify image data using transfer learning.

Firstly, let us define the environment variables:
```
$ cd src/cloud_tools/gcp
$ . example/env_variable.sh
```

In order to deploy the processing pipeline on evaluation data, execute the following command:
```
$ python example/image_preprocess.py --input_dict "$DICT_FILE" --input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv" --output_path "${GCS_PATH}/preproc/eval" --cloud
```

For the training data, use the following command:
```
$ python example/image_preprocess.py --input_dict "$DICT_FILE" --input_path "gs://cloud-ml-data/img/flower_photos/train_set.csv" --output_path "${GCS_PATH}/preproc/train" --cloud
```

Now that the embeddings of training and evaluation data set are stored on Storage, we can train our model:
(use admin privilege if necessary)
```
$ gcloud ml-engine jobs submit training "$JOB_NAME" \
--stream-logs \
--module-name example.image_classification_task \
--package-path example \
--staging-bucket "$BUCKET_NAME" \
--region "$REGION" \
--runtime-version=1.4\
-- \
--output_path "${GCS_PATH}/training" \
--eval_data_paths "${GCS_PATH}/preproc/eval*" \
--train_data_paths "${GCS_PATH}/preproc/train*"
```

You can follow up the training steps with tensorboard:
```
$ tensorboard --logdir=$GCS_PATH/training
```

Export the model:
```
$ gcloud ml-engine models create "$MODEL_NAME" \
--regions "$REGION"
```

Deploy the model for prediction:
```
$ gcloud ml-engine versions create "$VERSION_NAME" \
--model "$MODEL_NAME" \
--origin "${GCS_PATH}/training/model" \
--runtime-version=1.4
```

Make a prediction from an image:
```
$ cd data/flowers
$ python -c 'import base64, sys, json; img = base64.b64encode(open(sys.argv[1], "rb").read()); print json.dumps({"key":"0", "image_bytes": {"b64": img}})' daisy.jpg &> request.json
$ gcloud ml-engine predict --model ${MODEL_NAME} --json-instances request.json
```
##### /image_classification

Train an image classification vgg model with natural images dataset preprocessed:
```
$ gcloud ml-engine jobs submit training "$JOB_NAME" \
--stream-logs --module-name image_classification.image_classification_task \
--package-path image_classification \
--staging-bucket "$BUCKET_NAME" --region "$REGION" \
--runtime-version=1.4 \
-- \
--output_path "${BUCKET_NAME}/model/vgg_16/natural_images/training" \
--eval_data_paths "${GCS_PATH}/validation*" \
--train_data_paths "${GCS_PATH}/training*"
```


### model

#### /computer_vision

Computer vision models.

#### /ner

Named Entity Recognition models.

#### /rnn

Recurrent Neural Networks.

### processing

Processing components

#### /dataset_utils

The dataset utils module aims at normalizing access to the dataset.
From dataset loading in memory to train-test splitting, the module exposes all
the needed utilities for serveral kind of machine learning models.

#### /variable_selection

Variable selection or the process of reducing the number of variable used by a model is
often a good manner to improve a model stability and performance. In this module, several
methods are implemented in order to be used by all kind of models.

## Data

### Caltech-256

The Caltech-256 dataset contains 30 thousand images for 256 object categories. Images are in jpeg file format. Each
object category counts at least 80 images.

For further information, follow this link: www.vision.caltech.edu/Image_Datasets/Caltech256/

### Natural image

This dataset contains 6,899 images from 8 distinct classes
compiled from various sources. The classes include airplane,
car, cat, dog, flower, fruit, motorbike and person.

For further information, follow this link: https://www.kaggle.com/prasunroy/natural-images

## MetaData

## Documentation

## Executable

Empty file added data/.gitkeep
Empty file.
Binary file added doc/Convolution_network.pdf
Binary file not shown.
Binary file added doc/LSTM.pdf
Binary file not shown.
Binary file added doc/Unsupervised image segmentation.pdf
Binary file not shown.
Binary file added doc/VGG16-architecture-16.ppm
Binary file not shown.
Binary file added doc/Variable selection for regression.pdf
Binary file not shown.
Binary file not shown.
62 changes: 62 additions & 0 deletions exec/dataset_building_exec.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import argparse
import logging
import os
import traceback
import pandas as pd

from dataset_utils.classImageClassificationDataset import ImageClassificationDataset

parser = argparse.ArgumentParser(description='Dataset building program')
parser.add_argument('--dataset-path', type=str, help='Path to the dataset', default=".")
parser.add_argument('--train-size', type=float, help='Training set size', default=0.7)
parser.add_argument('--validation-size', type=float, help='Validation set size', default=0.2)
parser.add_argument('--test-size', type=float, help='Test set size', default=0.1)
parser.add_argument('--name', type=str, help='The unique name of the program', default="")
parser.add_argument('--debug', type=int, help='Debug mode', default=0)
args = parser.parse_args()

logger = logging.Logger("dataset_building_exec",
level=logging.DEBUG if args.debug else logging.INFO)

consoleHandler = logging.StreamHandler()
consoleHandler.setLevel(logging.DEBUG if args.debug else logging.INFO)
logger.addHandler(consoleHandler)

fileHandler = logging.FileHandler("dataset_building_exec.log")
fileHandler.setLevel(logging.DEBUG)
logger.addHandler(fileHandler)

assert os.path.isdir(args.dataset_path), "{0} is not a valid directory".format(args.dataset_path)

try:

classes_path = os.path.join(args.dataset_path, "classes.csv")
training_set_path = os.path.join(args.dataset_path, "training_set.csv")
validation_set_path = os.path.join(args.dataset_path, "validation_set.csv")
test_set_path = os.path.join(args.dataset_path, "test_set.csv")

os.remove(classes_path)
os.remove(training_set_path)
os.remove(validation_set_path)
os.remove(test_set_path)

dataset_1 = ImageClassificationDataset(
args.dataset_path, train_size=args.train_size,
val_size=args.validation_size, test_size=args.test_size,
absolute_path=False)

classes = dataset_1.labels
classes.update({0: 'ambiguous'})

df_classes = pd.DataFrame.from_dict(classes, orient='index')
df_training = pd.DataFrame(dataset_1.training_set)
df_val = pd.DataFrame(dataset_1.training_set)
df_test = pd.DataFrame(dataset_1.training_set)

df_classes.to_csv(classes_path, header=False, index=True)
df_training.to_csv(training_set_path, header=False, index=False)
df_val.to_csv(validation_set_path, header=False, index=False)
df_test.to_csv(test_set_path, header=False, index=False)

except Exception:
logger.error(traceback.format_exc())
56 changes: 56 additions & 0 deletions exec/image_classification_exec.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
import argparse
import logging
import os
import traceback

from computer_vision.classVgg import Vgg
from dataset_utils.classImageClassificationDataset import ImageClassificationDataset

parser = argparse.ArgumentParser(description='Image classification program')
parser.add_argument('--dataset-path', type=str, help='Path to the dataset', default=".")
parser.add_argument('--metadata-path', type=str, help='Path to the metadata', default=".")
parser.add_argument('--batch-size', type=int, help='Batch size', default=1)
parser.add_argument('--train-size', type=float, help='Training set size', default=0.7)
parser.add_argument('--validation-size', type=float, help='Validation set size', default=0.2)
parser.add_argument('--test-size', type=float, help='Test set size', default=0.1)
parser.add_argument('--train', type=int, help='Training mode', default=1)
parser.add_argument('--optimizer', type=str, help='Optimizer', default='adam')
parser.add_argument('--learning-rate', type=float, help='Learning rate', default=0.01)
parser.add_argument('--from-pretrained', type=int, help='Transfer learning mode', default=0)
parser.add_argument('--name', type=str, help='The unique name of the program', default="vgg")
parser.add_argument('--debug', type=int, help='Debug mode', default=0)
args = parser.parse_args()

logger = logging.Logger("image_classification_exec",
level=logging.DEBUG if args.debug else logging.INFO)

consoleHandler = logging.StreamHandler()
consoleHandler.setLevel(logging.DEBUG if args.debug else logging.INFO)
logger.addHandler(consoleHandler)

fileHandler = logging.FileHandler("image_classification_exec.log")
fileHandler.setLevel(logging.DEBUG)
logger.addHandler(fileHandler)

assert os.path.isdir(args.dataset_path), "{0} is not a valid directory".format(args.dataset_path)

try:

cd_1 = ImageClassificationDataset(
args.dataset_path, train_size=args.train_size,
val_size=args.validation_size, test_size=args.test_size)
classes = cd_1.labels
classes.update({0: 'ambiguous'})

vgg_1 = Vgg(classes, batch_size=args.batch_size, height=224, width=224,
dim_out=len(classes), grayscale=True, binarize=False, normalize=False,
learning_rate=args.learning_rate, n_epochs=1, validation_step=10,
checkpoint_step=100, is_encoder=False, validation_size=10,
optimizer=args.optimizer, metadata_path=args.metadata_path,
name=args.name, from_pretrained=args.from_pretrained,
logger=logger, debug=args.debug)

vgg_1.fit(cd_1.training_set, cd_1.validation_set)

except Exception:
logger.error(traceback.format_exc())
Loading

0 comments on commit 9d3d909

Please sign in to comment.