WNet encoder graph

dorianb · Dec 17, 2018 · 9d3d909 · 9d3d909
commit 9d3d909
Show file tree

Hide file tree

Showing 63 changed files with 6,422 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,15 @@
+
+build
+venv
+*.log
+*.json
+.idea
+
+data/*
+!data/.gitkeep
+
+metadata/summaries/*
+!metadata/summaries/.gitkeep
+
+metadata/checkpoints/*
+!metadata/checkpoints/.gitkeep
diff --git a/README.md b/README.md
@@ -0,0 +1,165 @@
+# Machine learning toolkit
+
+This package is a personal project aimed for research purpose. It covers machine learning models implementation and 
+processing functions. These models are implemented and deployed mainly using Google technologies as Tensorflow and
+Google Cloud Platform utilities.
+
+Python 2.7 has been chosen to ensure components compatibility. The project is structured using Distutils as a build 
+automation tool. Test are implemented for each module using unittest.
+
+## Installation procedure
+
+Installation of third party librairies
+```
+$ pip install -r requirements.txt
+```
+
+Installation of modules
+```
+$ python setup.py install
+```
+
+## Packages
+
+### cloud_tools
+
+The package cloud_tools allows to deploy complex machine learning pipelines into the cloud.
+
+#### /gcp
+
+Utilities for Google Cloud Platform machine learning. Processing is realised using Google DataFlow (apache beam) and 
+model training and prediction are realised using Google ML Engine.
+
+##### /example 
+
+Transfer learning with flower dataset: https://cloud.google.com/ml-engine/docs/tensorflow/flowers-tutorial
+
+The example shows how to use dataflow and ml engine to preprocess image data and then apply a classifier model. The 
+purpose is to classify image data using transfer learning.
+
+Firstly, let us define the environment variables:
+```
+$ cd src/cloud_tools/gcp 
+$ . example/env_variable.sh
+```
+
+In order to deploy the processing pipeline on evaluation data, execute the following command:
+```
+$ python example/image_preprocess.py --input_dict "$DICT_FILE" --input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv" --output_path "${GCS_PATH}/preproc/eval" --cloud
+```
+
+For the training data, use the following command:
+```
+$ python example/image_preprocess.py --input_dict "$DICT_FILE" --input_path "gs://cloud-ml-data/img/flower_photos/train_set.csv" --output_path "${GCS_PATH}/preproc/train" --cloud
+```
+
+Now that the embeddings of training and evaluation data set are stored on Storage, we can train our model:
+(use admin privilege if necessary)
+```
+$ gcloud ml-engine jobs submit training "$JOB_NAME" \
+    --stream-logs \
+    --module-name example.image_classification_task \
+    --package-path example \
+    --staging-bucket "$BUCKET_NAME" \
+    --region "$REGION" \
+    --runtime-version=1.4\
+    -- \
+    --output_path "${GCS_PATH}/training" \
+    --eval_data_paths "${GCS_PATH}/preproc/eval*" \
+    --train_data_paths "${GCS_PATH}/preproc/train*"
+```
+
+You can follow up the training steps with tensorboard:
+```
+$ tensorboard --logdir=$GCS_PATH/training
+```
+
+Export the model:
+```
+$ gcloud ml-engine models create "$MODEL_NAME" \
+  --regions "$REGION"
+```
+
+Deploy the model for prediction:
+```
+$ gcloud ml-engine versions create "$VERSION_NAME" \
+  --model "$MODEL_NAME" \
+  --origin "${GCS_PATH}/training/model" \
+  --runtime-version=1.4
+```
+
+Make a prediction from an image:
+```
+$ cd data/flowers
+$ python -c 'import base64, sys, json; img = base64.b64encode(open(sys.argv[1], "rb").read()); print json.dumps({"key":"0", "image_bytes": {"b64": img}})' daisy.jpg &> request.json
+$ gcloud ml-engine predict --model ${MODEL_NAME} --json-instances request.json
+```
+##### /image_classification
+
+Train an image classification vgg model with natural images dataset preprocessed:
+```
+$ gcloud ml-engine jobs submit training "$JOB_NAME" \
+ --stream-logs --module-name image_classification.image_classification_task \ 
+ --package-path image_classification \
+ --staging-bucket "$BUCKET_NAME" --region "$REGION" \
+ --runtime-version=1.4 \   
+ -- \
+ --output_path "${BUCKET_NAME}/model/vgg_16/natural_images/training" \
+ --eval_data_paths "${GCS_PATH}/validation*" \
+ --train_data_paths "${GCS_PATH}/training*"
+```
+
+
+### model
+
+#### /computer_vision
+
+Computer vision models.
+
+#### /ner
+
+Named Entity Recognition models.
+
+#### /rnn
+
+Recurrent Neural Networks.
+
+### processing
+
+Processing components
+
+#### /dataset_utils
+
+The dataset utils module aims at normalizing access to the dataset.
+From dataset loading in memory to train-test splitting, the module exposes all 
+the needed utilities for serveral kind of machine learning models.
+
+#### /variable_selection
+
+Variable selection or the process of reducing the number of variable used by a model is
+often a good manner to improve a model stability and performance. In this module, several
+methods are implemented in order to be used by all kind of models.
+
+## Data
+
+### Caltech-256
+
+The Caltech-256 dataset contains 30 thousand images for 256 object categories. Images are in jpeg file format. Each 
+object category counts at least 80 images.
+
+For further information, follow this link: www.vision.caltech.edu/Image_Datasets/Caltech256/
+
+### Natural image
+
+This dataset contains 6,899 images from 8 distinct classes 
+compiled from various sources. The classes include airplane,
+car, cat, dog, flower, fruit, motorbike and person. 
+
+For further information, follow this link: https://www.kaggle.com/prasunroy/natural-images
+
+## MetaData
+
+## Documentation
+
+## Executable
+
diff --git a/data/.gitkeep b/data/.gitkeep
diff --git a/doc/Convolution_network.pdf b/doc/Convolution_network.pdf
diff --git a/doc/LSTM.pdf b/doc/LSTM.pdf
diff --git a/doc/Unsupervised image segmentation.pdf b/doc/Unsupervised image segmentation.pdf
diff --git a/doc/VGG16-architecture-16.ppm b/doc/VGG16-architecture-16.ppm
diff --git a/doc/Variable selection for regression.pdf b/doc/Variable selection for regression.pdf
diff --git a/doc/Variable selection methods for regression.pdf b/doc/Variable selection methods for regression.pdf
diff --git a/exec/dataset_building_exec.py b/exec/dataset_building_exec.py
@@ -0,0 +1,62 @@
+import argparse
+import logging
+import os
+import traceback
+import pandas as pd
+
+from dataset_utils.classImageClassificationDataset import ImageClassificationDataset
+
+parser = argparse.ArgumentParser(description='Dataset building program')
+parser.add_argument('--dataset-path', type=str, help='Path to the dataset', default=".")
+parser.add_argument('--train-size', type=float, help='Training set size', default=0.7)
+parser.add_argument('--validation-size', type=float, help='Validation set size', default=0.2)
+parser.add_argument('--test-size', type=float, help='Test set size', default=0.1)
+parser.add_argument('--name', type=str, help='The unique name of the program', default="")
+parser.add_argument('--debug', type=int, help='Debug mode', default=0)
+args = parser.parse_args()
+
+logger = logging.Logger("dataset_building_exec",
+                        level=logging.DEBUG if args.debug else logging.INFO)
+
+consoleHandler = logging.StreamHandler()
+consoleHandler.setLevel(logging.DEBUG if args.debug else logging.INFO)
+logger.addHandler(consoleHandler)
+
+fileHandler = logging.FileHandler("dataset_building_exec.log")
+fileHandler.setLevel(logging.DEBUG)
+logger.addHandler(fileHandler)
+
+assert os.path.isdir(args.dataset_path), "{0} is not a valid directory".format(args.dataset_path)
+
+try:
+
+    classes_path = os.path.join(args.dataset_path, "classes.csv")
+    training_set_path = os.path.join(args.dataset_path, "training_set.csv")
+    validation_set_path = os.path.join(args.dataset_path, "validation_set.csv")
+    test_set_path = os.path.join(args.dataset_path, "test_set.csv")
+
+    os.remove(classes_path)
+    os.remove(training_set_path)
+    os.remove(validation_set_path)
+    os.remove(test_set_path)
+
+    dataset_1 = ImageClassificationDataset(
+        args.dataset_path, train_size=args.train_size,
+        val_size=args.validation_size, test_size=args.test_size,
+        absolute_path=False)
+
+    classes = dataset_1.labels
+    classes.update({0: 'ambiguous'})
+
+    df_classes = pd.DataFrame.from_dict(classes, orient='index')
+    df_training = pd.DataFrame(dataset_1.training_set)
+    df_val = pd.DataFrame(dataset_1.training_set)
+    df_test = pd.DataFrame(dataset_1.training_set)
+
+    df_classes.to_csv(classes_path, header=False, index=True)
+    df_training.to_csv(training_set_path, header=False, index=False)
+    df_val.to_csv(validation_set_path, header=False, index=False)
+    df_test.to_csv(test_set_path, header=False, index=False)
+
+except Exception:
+    logger.error(traceback.format_exc())
diff --git a/exec/image_classification_exec.py b/exec/image_classification_exec.py
@@ -0,0 +1,56 @@
+import argparse
+import logging
+import os
+import traceback
+
+from computer_vision.classVgg import Vgg
+from dataset_utils.classImageClassificationDataset import ImageClassificationDataset
+
+parser = argparse.ArgumentParser(description='Image classification program')
+parser.add_argument('--dataset-path', type=str, help='Path to the dataset', default=".")
+parser.add_argument('--metadata-path', type=str, help='Path to the metadata', default=".")
+parser.add_argument('--batch-size', type=int, help='Batch size', default=1)
+parser.add_argument('--train-size', type=float, help='Training set size', default=0.7)
+parser.add_argument('--validation-size', type=float, help='Validation set size', default=0.2)
+parser.add_argument('--test-size', type=float, help='Test set size', default=0.1)
+parser.add_argument('--train', type=int, help='Training mode', default=1)
+parser.add_argument('--optimizer', type=str, help='Optimizer', default='adam')
+parser.add_argument('--learning-rate', type=float, help='Learning rate', default=0.01)
+parser.add_argument('--from-pretrained', type=int, help='Transfer learning mode', default=0)
+parser.add_argument('--name', type=str, help='The unique name of the program', default="vgg")
+parser.add_argument('--debug', type=int, help='Debug mode', default=0)
+args = parser.parse_args()
+
+logger = logging.Logger("image_classification_exec",
+                        level=logging.DEBUG if args.debug else logging.INFO)
+
+consoleHandler = logging.StreamHandler()
+consoleHandler.setLevel(logging.DEBUG if args.debug else logging.INFO)
+logger.addHandler(consoleHandler)
+
+fileHandler = logging.FileHandler("image_classification_exec.log")
+fileHandler.setLevel(logging.DEBUG)
+logger.addHandler(fileHandler)
+
+assert os.path.isdir(args.dataset_path), "{0} is not a valid directory".format(args.dataset_path)
+
+try:
+
+    cd_1 = ImageClassificationDataset(
+        args.dataset_path, train_size=args.train_size,
+        val_size=args.validation_size, test_size=args.test_size)
+    classes = cd_1.labels
+    classes.update({0: 'ambiguous'})
+
+    vgg_1 = Vgg(classes, batch_size=args.batch_size, height=224, width=224,
+                dim_out=len(classes), grayscale=True, binarize=False, normalize=False,
+                learning_rate=args.learning_rate, n_epochs=1, validation_step=10,
+                checkpoint_step=100, is_encoder=False, validation_size=10,
+                optimizer=args.optimizer, metadata_path=args.metadata_path,
+                name=args.name, from_pretrained=args.from_pretrained,
+                logger=logger, debug=args.debug)
+
+    vgg_1.fit(cd_1.training_set, cd_1.validation_set)
+
+except Exception:
+    logger.error(traceback.format_exc())