-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 9d3d909
Showing
63 changed files
with
6,422 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
|
||
build | ||
venv | ||
*.log | ||
*.json | ||
.idea | ||
|
||
data/* | ||
!data/.gitkeep | ||
|
||
metadata/summaries/* | ||
!metadata/summaries/.gitkeep | ||
|
||
metadata/checkpoints/* | ||
!metadata/checkpoints/.gitkeep |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
# Machine learning toolkit | ||
|
||
This package is a personal project aimed for research purpose. It covers machine learning models implementation and | ||
processing functions. These models are implemented and deployed mainly using Google technologies as Tensorflow and | ||
Google Cloud Platform utilities. | ||
|
||
Python 2.7 has been chosen to ensure components compatibility. The project is structured using Distutils as a build | ||
automation tool. Test are implemented for each module using unittest. | ||
|
||
## Installation procedure | ||
|
||
Installation of third party librairies | ||
``` | ||
$ pip install -r requirements.txt | ||
``` | ||
|
||
Installation of modules | ||
``` | ||
$ python setup.py install | ||
``` | ||
|
||
## Packages | ||
|
||
### cloud_tools | ||
|
||
The package cloud_tools allows to deploy complex machine learning pipelines into the cloud. | ||
|
||
#### /gcp | ||
|
||
Utilities for Google Cloud Platform machine learning. Processing is realised using Google DataFlow (apache beam) and | ||
model training and prediction are realised using Google ML Engine. | ||
|
||
##### /example | ||
|
||
Transfer learning with flower dataset: https://cloud.google.com/ml-engine/docs/tensorflow/flowers-tutorial | ||
|
||
The example shows how to use dataflow and ml engine to preprocess image data and then apply a classifier model. The | ||
purpose is to classify image data using transfer learning. | ||
|
||
Firstly, let us define the environment variables: | ||
``` | ||
$ cd src/cloud_tools/gcp | ||
$ . example/env_variable.sh | ||
``` | ||
|
||
In order to deploy the processing pipeline on evaluation data, execute the following command: | ||
``` | ||
$ python example/image_preprocess.py --input_dict "$DICT_FILE" --input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv" --output_path "${GCS_PATH}/preproc/eval" --cloud | ||
``` | ||
|
||
For the training data, use the following command: | ||
``` | ||
$ python example/image_preprocess.py --input_dict "$DICT_FILE" --input_path "gs://cloud-ml-data/img/flower_photos/train_set.csv" --output_path "${GCS_PATH}/preproc/train" --cloud | ||
``` | ||
|
||
Now that the embeddings of training and evaluation data set are stored on Storage, we can train our model: | ||
(use admin privilege if necessary) | ||
``` | ||
$ gcloud ml-engine jobs submit training "$JOB_NAME" \ | ||
--stream-logs \ | ||
--module-name example.image_classification_task \ | ||
--package-path example \ | ||
--staging-bucket "$BUCKET_NAME" \ | ||
--region "$REGION" \ | ||
--runtime-version=1.4\ | ||
-- \ | ||
--output_path "${GCS_PATH}/training" \ | ||
--eval_data_paths "${GCS_PATH}/preproc/eval*" \ | ||
--train_data_paths "${GCS_PATH}/preproc/train*" | ||
``` | ||
|
||
You can follow up the training steps with tensorboard: | ||
``` | ||
$ tensorboard --logdir=$GCS_PATH/training | ||
``` | ||
|
||
Export the model: | ||
``` | ||
$ gcloud ml-engine models create "$MODEL_NAME" \ | ||
--regions "$REGION" | ||
``` | ||
|
||
Deploy the model for prediction: | ||
``` | ||
$ gcloud ml-engine versions create "$VERSION_NAME" \ | ||
--model "$MODEL_NAME" \ | ||
--origin "${GCS_PATH}/training/model" \ | ||
--runtime-version=1.4 | ||
``` | ||
|
||
Make a prediction from an image: | ||
``` | ||
$ cd data/flowers | ||
$ python -c 'import base64, sys, json; img = base64.b64encode(open(sys.argv[1], "rb").read()); print json.dumps({"key":"0", "image_bytes": {"b64": img}})' daisy.jpg &> request.json | ||
$ gcloud ml-engine predict --model ${MODEL_NAME} --json-instances request.json | ||
``` | ||
##### /image_classification | ||
|
||
Train an image classification vgg model with natural images dataset preprocessed: | ||
``` | ||
$ gcloud ml-engine jobs submit training "$JOB_NAME" \ | ||
--stream-logs --module-name image_classification.image_classification_task \ | ||
--package-path image_classification \ | ||
--staging-bucket "$BUCKET_NAME" --region "$REGION" \ | ||
--runtime-version=1.4 \ | ||
-- \ | ||
--output_path "${BUCKET_NAME}/model/vgg_16/natural_images/training" \ | ||
--eval_data_paths "${GCS_PATH}/validation*" \ | ||
--train_data_paths "${GCS_PATH}/training*" | ||
``` | ||
|
||
|
||
### model | ||
|
||
#### /computer_vision | ||
|
||
Computer vision models. | ||
|
||
#### /ner | ||
|
||
Named Entity Recognition models. | ||
|
||
#### /rnn | ||
|
||
Recurrent Neural Networks. | ||
|
||
### processing | ||
|
||
Processing components | ||
|
||
#### /dataset_utils | ||
|
||
The dataset utils module aims at normalizing access to the dataset. | ||
From dataset loading in memory to train-test splitting, the module exposes all | ||
the needed utilities for serveral kind of machine learning models. | ||
|
||
#### /variable_selection | ||
|
||
Variable selection or the process of reducing the number of variable used by a model is | ||
often a good manner to improve a model stability and performance. In this module, several | ||
methods are implemented in order to be used by all kind of models. | ||
|
||
## Data | ||
|
||
### Caltech-256 | ||
|
||
The Caltech-256 dataset contains 30 thousand images for 256 object categories. Images are in jpeg file format. Each | ||
object category counts at least 80 images. | ||
|
||
For further information, follow this link: www.vision.caltech.edu/Image_Datasets/Caltech256/ | ||
|
||
### Natural image | ||
|
||
This dataset contains 6,899 images from 8 distinct classes | ||
compiled from various sources. The classes include airplane, | ||
car, cat, dog, flower, fruit, motorbike and person. | ||
|
||
For further information, follow this link: https://www.kaggle.com/prasunroy/natural-images | ||
|
||
## MetaData | ||
|
||
## Documentation | ||
|
||
## Executable | ||
|
Empty file.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
import argparse | ||
import logging | ||
import os | ||
import traceback | ||
import pandas as pd | ||
|
||
from dataset_utils.classImageClassificationDataset import ImageClassificationDataset | ||
|
||
parser = argparse.ArgumentParser(description='Dataset building program') | ||
parser.add_argument('--dataset-path', type=str, help='Path to the dataset', default=".") | ||
parser.add_argument('--train-size', type=float, help='Training set size', default=0.7) | ||
parser.add_argument('--validation-size', type=float, help='Validation set size', default=0.2) | ||
parser.add_argument('--test-size', type=float, help='Test set size', default=0.1) | ||
parser.add_argument('--name', type=str, help='The unique name of the program', default="") | ||
parser.add_argument('--debug', type=int, help='Debug mode', default=0) | ||
args = parser.parse_args() | ||
|
||
logger = logging.Logger("dataset_building_exec", | ||
level=logging.DEBUG if args.debug else logging.INFO) | ||
|
||
consoleHandler = logging.StreamHandler() | ||
consoleHandler.setLevel(logging.DEBUG if args.debug else logging.INFO) | ||
logger.addHandler(consoleHandler) | ||
|
||
fileHandler = logging.FileHandler("dataset_building_exec.log") | ||
fileHandler.setLevel(logging.DEBUG) | ||
logger.addHandler(fileHandler) | ||
|
||
assert os.path.isdir(args.dataset_path), "{0} is not a valid directory".format(args.dataset_path) | ||
|
||
try: | ||
|
||
classes_path = os.path.join(args.dataset_path, "classes.csv") | ||
training_set_path = os.path.join(args.dataset_path, "training_set.csv") | ||
validation_set_path = os.path.join(args.dataset_path, "validation_set.csv") | ||
test_set_path = os.path.join(args.dataset_path, "test_set.csv") | ||
|
||
os.remove(classes_path) | ||
os.remove(training_set_path) | ||
os.remove(validation_set_path) | ||
os.remove(test_set_path) | ||
|
||
dataset_1 = ImageClassificationDataset( | ||
args.dataset_path, train_size=args.train_size, | ||
val_size=args.validation_size, test_size=args.test_size, | ||
absolute_path=False) | ||
|
||
classes = dataset_1.labels | ||
classes.update({0: 'ambiguous'}) | ||
|
||
df_classes = pd.DataFrame.from_dict(classes, orient='index') | ||
df_training = pd.DataFrame(dataset_1.training_set) | ||
df_val = pd.DataFrame(dataset_1.training_set) | ||
df_test = pd.DataFrame(dataset_1.training_set) | ||
|
||
df_classes.to_csv(classes_path, header=False, index=True) | ||
df_training.to_csv(training_set_path, header=False, index=False) | ||
df_val.to_csv(validation_set_path, header=False, index=False) | ||
df_test.to_csv(test_set_path, header=False, index=False) | ||
|
||
except Exception: | ||
logger.error(traceback.format_exc()) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
import argparse | ||
import logging | ||
import os | ||
import traceback | ||
|
||
from computer_vision.classVgg import Vgg | ||
from dataset_utils.classImageClassificationDataset import ImageClassificationDataset | ||
|
||
parser = argparse.ArgumentParser(description='Image classification program') | ||
parser.add_argument('--dataset-path', type=str, help='Path to the dataset', default=".") | ||
parser.add_argument('--metadata-path', type=str, help='Path to the metadata', default=".") | ||
parser.add_argument('--batch-size', type=int, help='Batch size', default=1) | ||
parser.add_argument('--train-size', type=float, help='Training set size', default=0.7) | ||
parser.add_argument('--validation-size', type=float, help='Validation set size', default=0.2) | ||
parser.add_argument('--test-size', type=float, help='Test set size', default=0.1) | ||
parser.add_argument('--train', type=int, help='Training mode', default=1) | ||
parser.add_argument('--optimizer', type=str, help='Optimizer', default='adam') | ||
parser.add_argument('--learning-rate', type=float, help='Learning rate', default=0.01) | ||
parser.add_argument('--from-pretrained', type=int, help='Transfer learning mode', default=0) | ||
parser.add_argument('--name', type=str, help='The unique name of the program', default="vgg") | ||
parser.add_argument('--debug', type=int, help='Debug mode', default=0) | ||
args = parser.parse_args() | ||
|
||
logger = logging.Logger("image_classification_exec", | ||
level=logging.DEBUG if args.debug else logging.INFO) | ||
|
||
consoleHandler = logging.StreamHandler() | ||
consoleHandler.setLevel(logging.DEBUG if args.debug else logging.INFO) | ||
logger.addHandler(consoleHandler) | ||
|
||
fileHandler = logging.FileHandler("image_classification_exec.log") | ||
fileHandler.setLevel(logging.DEBUG) | ||
logger.addHandler(fileHandler) | ||
|
||
assert os.path.isdir(args.dataset_path), "{0} is not a valid directory".format(args.dataset_path) | ||
|
||
try: | ||
|
||
cd_1 = ImageClassificationDataset( | ||
args.dataset_path, train_size=args.train_size, | ||
val_size=args.validation_size, test_size=args.test_size) | ||
classes = cd_1.labels | ||
classes.update({0: 'ambiguous'}) | ||
|
||
vgg_1 = Vgg(classes, batch_size=args.batch_size, height=224, width=224, | ||
dim_out=len(classes), grayscale=True, binarize=False, normalize=False, | ||
learning_rate=args.learning_rate, n_epochs=1, validation_step=10, | ||
checkpoint_step=100, is_encoder=False, validation_size=10, | ||
optimizer=args.optimizer, metadata_path=args.metadata_path, | ||
name=args.name, from_pretrained=args.from_pretrained, | ||
logger=logger, debug=args.debug) | ||
|
||
vgg_1.fit(cd_1.training_set, cd_1.validation_set) | ||
|
||
except Exception: | ||
logger.error(traceback.format_exc()) |
Oops, something went wrong.