The Chicago Taxi example demonstrates the end-to-end workflow and steps of how to analyze, validate and transform data, train a model, analyze and serve it. It uses:
- TensorFlow Data Validation for analyzing and validating data.
- TensorFlow Transform for feature preprocessing,
- TensorFlow Estimators for training,
- TensorFlow Model Analysis and Jupyter for evaluation, and
- TensorFlow Serving for serving.
The example shows two modes of deployment:
- Local mode with all necessary dependencies and components deployed locally.
- Cloud mode where all components are deployed on Google Cloud.
This example uses the Taxi Trips dataset released by the City of Chicago.
Note: This site provides applications using data that has been modified for use from its original source, www.cityofchicago.org, the official website of the City of Chicago. The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at one’s own risk.
Read more about the dataset in Google BigQuery. Explore the full dataset in the BigQuery UI.
Apache Beam is used for its distributed processing. This example requires Python 2.7 since the Apache Beam SDK (BEAM-1373) is not yet available for Python 3.
Development for this example will be isolated in a Python virtual environment. This allows us to experiment with different versions of dependencies.
There are many ways to install virtualenv
, see the
TensorFlow install guides for different
platforms, but here are a couple:
- For Linux:
sudo apt-get install python-pip python-virtualenv python-dev build-essential
- For Mac:
sudo easy_install pip
pip install --upgrade virtualenv
Create a Python 2.7 virtual environment for this example and activate the
virtualenv
:
virtualenv -p python2.7 taxi
source ./taxi/bin/activate
Next, install the dependencies required by the Chicago Taxi example:
pip install -r requirements.txt
Register the TensorFlow Model Analysis rendering components with Jupyter Notebook:
jupyter nbextension install --py --symlink --sys-prefix tensorflow_model_analysis
jupyter nbextension enable --py --sys-prefix tensorflow_model_analysis
The benefit of the local example is that you can edit any part of the pipeline and experiment very quickly with various components. The example comes with a small subset of the Taxi Trips dataset as CSV files.
tf.DataValidation
(tfdv_analyze_and_validate.py
) allows you to analyze and
validate the input dataset. To run this step locally:
bash ./tfdv_analyze_and_validate_local.sh
We first compute descriptive statistics
over the training data. The statistics provide a quick overview of the data in
terms of the features that are present and the shapes of their value
distributions. tf.DataValidation
uses Apache Beam's
data-parallel processing framework to scale the computation of statistics over
large datasets. To compute statistics, we use the GenerateStatistics
Beam
transform which outputs a DatasetFeatureStatisticsList
protocol buffer.
Next, we infer a schema
for the data using the computed statistics. The schema describes the expected
properties of the data such as which features are expected to be present,
their type, or the expected domains of features, to name a few. In short, the
schema describes the expectations for "correct" data and can thus be used to
detect errors in the data. To infer the schema, we use the infer_schema
function which takes as input the DatasetFeatureStatisticsList
protocol buffer
produced in the above step and outputs a Schema
protocol buffer.
In general, tf.DataValidation
uses conservative heuristics to infer stable
data properties from the statistics in order to avoid overfitting the schema to
the specific dataset. It is strongly advised to review the inferred schema and
refine it as needed (the schema is stored in the file
./data/local_tfdv_output/schema.pbtxt
), to
capture any domain knowledge about the data that tf.DataValidation
's
heuristics might have missed.
Notice that the Schema
generated by tf.DataValidation
is used by the
subsequent steps (preprocess_local.sh
, train_local.sh
,
process_tfma_local.sh
, and classify_local.sh
) to parse the input examples.
Finally, we check whether the eval dataset conforms to the expectations set in
the schema or whether there exist any data anomalies.
tf.DataValidation
performs this check by computing statistics over the eval
dataset and matching the computed statistics against the schema and marking any
discrepancies. To validate the statistics against the schema, we use the
validate_statistics
function which takes as input the statistics and the
schema, and outputs a Anomalies
protocol buffer.
tf.DataValidation
also provides functions to visualize the statistics, schema
and the anomalies in Jupyter. The notebook,
chicago_taxi_tfma_local_playground.ipynb
describes the above steps as well as the visualization functions in detail.
tf.Transform
(preprocess.py
) allows preprocessing the results with
full-pass operations over the dataset. To run this step locally:
bash ./preprocess_local.sh
Have tf.Transform
compute the global statistics (mean, standard deviation,
bucket cutoffs, etc.) in the preprocessing_fn
:
- Take dense float features such as trip distance and time, compute the global mean and standard deviations, and scale features to their z scores. This allows the SGD optimizer to treat steps in different directions more uniformly.
- Create a finite vocabulary for string features. This allow us to treat them as categoricals in our model.
- Bucket our longitude and latitude features. This allows the model to fit multiple weights to different parts of the lat/long grid.
- Include time-of-day and day-of-week features. This allows the model to more easily pick up on seasonality.
- Our label is binary:
1
if the tip is more than 20% of the fare,0
if it is lower.
Preprocessing creates TensorFlow operations to apply the transforms and leaves TensorFlow placeholders in the graph for the mean, bucket cutoffs, etc.
Call the Analyze
function in tf.Transform
to build the MapReduce-style
callbacks to compute the statistics across the entire data set. Apache Beam
allows applications to write such data-transforming code once and handles the
job of placing the code onto workers—whether they are in the cloud, across an
enterprise cluster or on the local machine. In this part of the example, use
DirectRunner
(see preprocess_local.sh
) to request that our code runs
on the local machine. At the end of the Analyze job, the placeholders are
replaced with their respective statistics (mean, standard deviation, etc).
Notice that tf.Transform
is also used to shuffle the examples
(beam.transforms.Reshuffle
). This is very important for the efficiency of
non-convex stochastic learners such as SGD with deep neural networks.
Finally, call WriteTransformFn
to save the transform and
TransformDataset
to create the examples for training. There are two key
outputs of the preprocessing step:
- The
SavedModel
containing the transform graph; - Materialized, transformed examples in compressed
TFRecord
files (these are inputs to the TensorFlow trainer).
The next step trains the model using TensorFlow. To run this step locally, call:
bash ./train_local.sh
The model leverages TensorFlow’s
Estimators and is created in the
build_estimator
function in model.py
. The trainer's input takes the
materialized, transformed examples from the previous step. Notice the pattern of
sharing schema information between preprocessing and training using taxi.py
to avoid redundant code.
The input_fn
function builds a parser that takes tf.Example
protos from
the materialized TFRecord
file emitted in our previous preprocessing step. It
also applies the tf.Transform
operations built in the preprocessing step.
Also, do not forget to remove the label since we don't want the model to treat
it as a feature.
During training, feature_columns
also come into play and tell the model how
to use features. For example, vocabulary features are fed into the model with
categorical_column_with_identity
and tell the model to logically treat
this feature as one-hot encoded.
Inside the eval_input_receiver_fn
callback, emit a TensorFlow graph that
parses raw examples, identifies the features and label, and applies the
tf.Transform
graph used in the TensorFlow Model Analysis batch job.
Finally, the model emits a graph suitable for serving.
The trainer runs a quick evaluation at the end of the batch job. This limited evaluation can run only on a single machine. Later in this guide, we will use TensorFlow Model Analysis—which leverages Apache Beam—for distributed evaluation.
To recap, the trainer outputs the following:
- A
SavedModel
containing the serving graph to use with TensorFlow Serving. - A
SavedModel
containing the evaluation graph to use with TensorFlow Model Analysis.
Now run a batch job to evaluate the model against the entire data set. To run the batch evaluator:
bash ./process_tfma_local.sh
This step is executed locally with a small CSV dataset. TensorFlow Model Analysis runs a full pass over the dataset and computes the metrics. In the Cloud section below, this step is run as a distributed job over a large data set.
Like TensorFlow Transform, tf.ModelAnalysis
uses Apache Beam to run the
distributed computation. The evaluation uses raw examples as input and not
transformed examples. The input is a local CSV file and uses the SavedModel
from the previous step, applies the tf.Transform
graph, then runs the model.
It is possible to analyze models in terms of transformed features instead of raw features, but that process is not described in this guide.
In process_tfma.py
we specified a slice_spec
that tells the
tf.ModelAnalysis
job the slices to visualize. Slices are subsets of the
data based on feature values. tf.ModelAnalysis
computes metrics for each of
those slices.
This job outputs a file that can be visualized in Jupyter in the
tf.model_analysis
renderer in the next stage.
Run the Jupyter notebook locally to view the sliced results:
jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser
This prints a URL like http://0.0.0.0:8888/?token=...
. You may need to
enter a username and password the first time you visit the page (these prompts
accept anything, so you can use "test" for both).
From the File tab, open the chicago_taxi_tfma.ipynb
notebook file.
Follow the instructions in the notebook until you visualize the slicing metrics
browser with tfma.view.render_slicing_metrics
and the time series graph
with tfma.view.render_time_series
.
Note: This notebook is self-contained and does not rely on running the prior scripts.
Now serve the created model with TensorFlow Serving. For this example, run the server in a Docker container that is run locally. Instructions for installing Docker locally are found in the Docker install documentation.
In the terminal, run the following script to start a server:
bash ./start_model_server_local.sh
This script pulls a TensorFlow Serving serving image and listens for for gRPC
requests on localhost
port 9000. The model server loads the model exported
from the trainer.
To send a request to the server for model inference, run:
bash ./classify_local.sh
For more information, see TensorFlow Serving.
The notebook,
chicago_taxi_tfma_local_playground.ipynb
,
is available which calls the same scripts above. It has a more detailed
description of the APIs—like custom metrics and plots—and the UI components.
Note: This notebook is self-contained and does not rely on running the prior scripts.
This section requires the local prerequisites and adds a few more for the Gooogle Cloud Platform.
Follow the Google Cloud Machine Learning Engine setup guide and the Google Cloud Dataflow setup guide before trying the example.
The example's execution speed may be limited by default Google Compute Engine quota. We recommend setting a sufficient quota for approximately 250 Dataflow VMs: 250 CPUs, 250 IP Addresses, and 62500 GB of Persistent Disk. For more details, please see the GCE Quota and Dataflow Quota documentation.
This example uses Google Cloud Storage Buckets to store data, and uses local environment variables to pass paths from job to job.
Authenticate and switch to your project:
gcloud auth application-default login
gcloud config set project $PROJECT_NAME
Create the gs://
bucket:
export MYBUCKET=gs://$(gcloud config list --format 'value(core.project)')-chicago-taxi
gsutil mb $MYBUCKET
Activate the virtualenv
(created above) to setup the shell environment:
source ./bin/taxi/activate
This Taxi Trips example is run in the cloud. Unlike the local example, the input is a much larger dataset hosted on Google BigQuery.
Use the same code from the local tf.DataValidation
(in tfdv_analyze_and_validate.py
) to do the distributed analysis and
validation. To start the job, run:
source ./tfdv_analyze_and_validate_dataflow.sh
You can see the status of the running job on the Google Cloud Console.
In this case, the data is read from Google BigQuery instead of a small, local
CSV file. Also, unlike our local example above, we are using the
DataflowRunner
to start distributed processing over several workers in the
cloud.
The outputs are the same as the local job, but stored in Google Cloud Storage:
- Training data statistics:
gsutil ls $TFDV_OUTPUT_PATH/train_stats.tfrecord
- Schema:
gsutil ls $TFDV_OUTPUT_PATH/schema.pbtxt
- Eval data statistics:
gsutil ls $TFDV_OUTPUT_PATH/eval_stats.tfrecord
- Anomalies:
gsutil ls $TFDV_OUTPUT_PATH/anomalies.pbtxt
Notice that the Schema
generated by tf.DataValidation
is used by the
subsequent steps (preprocess_dataflow.sh
, train_mlengine.sh
,
process_tfma_dataflow.sh
, and classify_mlengine.sh
) to parse the input
examples.
Use the same code from the local tf.Transform
(in preprocess.py
) to do
the distributed transform. To start the job, run:
source ./preprocess_dataflow.sh
You can see the status of the running job on the Google Cloud Console.
Unlike our local example above, we are using the DataflowRunner
to start distributed processing over several workers in the cloud.
The outputs are the same as the local job, but stored in Google Cloud Storage:
SavedModel
containing the transform graph:
gsutil ls $TFT_OUTPUT_PATH/transform_fn
- Materialized, transformed examples (train_transformed-*):
gsutil ls $TFT_OUTPUT_PATH
Run the distributed TensorFlow trainer in the cloud:
source ./train_mlengine.sh
See the status of the running job in the Google Cloud Console.
The trainer is running in the cloud using ML Engine and not Dataflow for the distributed computation.
Again, our outputs are identical to the local run:
SavedModel
containing the serving graph for TensorFlow Serving:
gsutil ls $WORKING_DIR/serving_model_dir/export/chicago-taxi
SavedModel
containing the evaluation graph for TensorFlow Model Analysis:
gsutil ls $WORKING_DIR/eval_model_dir
Run a distributed batch job to compute sliced metrics across the large Google
BigQuery dataset. In this step, tf.ModelAnalysis
takes advantage of the
DataflowRunner
to control its workers.
source ./process_tfma_dataflow.sh
The output is the eval_result
file rendered by the notebook in the next
step:
gsutil ls -l $TFT_OUTPUT_PATH/eval_result_dir
The steps for looking at your results in a notebook are identical to the ones
from running the local job. Go to the
chicago_taxi_tfma.ipynb
notebook and set up the output directory to see the results.
To serve the model from the cloud, run:
bash ./start_model_server_mlengine.sh
To send a request to the cloud:
bash ./classify_mlengine.sh