Skip to content

Latest commit

 

History

History
 
 

run_template

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Run template

main.py - Script to run an Apache Beam template on Google Cloud Dataflow.

The following examples show how to run the Word_Count template, but you can run any other template.

For the Word_Count template, we require to pass an output Cloud Storage path prefix, and optionally we can pass an inputFile Cloud Storage file pattern for the inputs. If inputFile is not passed, it will take gs://apache-beam-samples/shakespeare/kinglear.txt as default.

Before you begin

  1. Install the Cloud SDK.

  2. Create a new project.

  3. Enable billing.

  4. Enable the APIs: Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, BigQuery, Pub/Sub, Datastore, Cloud Functions, and Cloud Resource Manager.

  5. Setup the Cloud SDK to your GCP project.

    gcloud init
  6. Create a Cloud Storage bucket.

    gsutil mb gs://your-gcs-bucket

Setup

The following instructions will help you prepare your development environment.

  1. Install Python and virtualenv.

  2. Clone the python-docs-samples repository.

    git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
  3. Navigate to the sample code directory.

    cd python-docs-samples/dataflow/run_template
  4. Create a virtual environment and activate it.

    virtualenv env
    source env/bin/activate

    Once you are done, you can deactivate the virtualenv and go back to your global Python environment by running deactivate.

  5. Install the sample requirements.

    pip install -U -r requirements.txt

Running locally

To run a Dataflow template from the command line.

NOTE: To run locally, you'll need to create a service account key as a JSON file. Then export an environment variable called GOOGLE_APPLICATION_CREDENTIALS pointing it to your service account file.

python main.py \
  --project <your-gcp-project> \
  --job wordcount-$(date +'%Y%m%d-%H%M%S') \
  --template gs://dataflow-templates/latest/Word_Count \
  --inputFile gs://apache-beam-samples/shakespeare/kinglear.txt \
  --output gs://<your-gcs-bucket>/wordcount/outputs

Running in Python

To run a Dataflow template from Python.

NOTE: To run locally, you'll need to create a service account key as a JSON file. Then export an environment variable called GOOGLE_APPLICATION_CREDENTIALS pointing it to your service account file.

import main as run_template

run_template.run(
    project='your-gcp-project',
    job='unique-job-name',
    template='gs://dataflow-templates/latest/Word_Count',
    parameters={
        'inputFile': 'gs://apache-beam-samples/shakespeare/kinglear.txt',
        'output': 'gs://<your-gcs-bucket>/wordcount/outputs',
    }
)

Running in Cloud Functions

To deploy this into a Cloud Function and run a Dataflow template via an HTTP request as a REST API.

PROJECT=$(gcloud config get-value project) \
REGION=$(gcloud config get-value functions/region)

# Deploy the Cloud Function.
gcloud functions deploy run_template \
  --runtime python37 \
  --trigger-http \
  --region $REGION

# Call the Cloud Function via an HTTP request.
curl -X POST "https://$REGION-$PROJECT.cloudfunctions.net/run_template" \
  -d project=$PROJECT \
  -d job=wordcount-$(date +'%Y%m%d-%H%M%S') \
  -d template=gs://dataflow-templates/latest/Word_Count \
  -d inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt \
  -d output=gs://<your-gcs-bucket>/wordcount/outputs