main.py
- Script to run an Apache Beam template on Google Cloud Dataflow.
The following examples show how to run the Word_Count
template, but you can run any other template.
For the Word_Count
template, we require to pass an output
Cloud Storage path prefix, and optionally we can pass an inputFile
Cloud Storage file pattern for the inputs.
If inputFile
is not passed, it will take gs://apache-beam-samples/shakespeare/kinglear.txt
as default.
-
Install the Cloud SDK.
-
Enable the APIs: Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, BigQuery, Pub/Sub, Datastore, Cloud Functions, and Cloud Resource Manager.
-
Setup the Cloud SDK to your GCP project.
gcloud init
-
Create a Cloud Storage bucket.
gsutil mb gs://your-gcs-bucket
The following instructions will help you prepare your development environment.
-
Clone the
python-docs-samples
repository.git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
-
Navigate to the sample code directory.
cd python-docs-samples/dataflow/run_template
-
Create a virtual environment and activate it.
virtualenv env source env/bin/activate
Once you are done, you can deactivate the virtualenv and go back to your global Python environment by running
deactivate
. -
Install the sample requirements.
pip install -U -r requirements.txt
To run a Dataflow template from the command line.
NOTE: To run locally, you'll need to create a service account key as a JSON file. Then export an environment variable called
GOOGLE_APPLICATION_CREDENTIALS
pointing it to your service account file.
python main.py \
--project <your-gcp-project> \
--job wordcount-$(date +'%Y%m%d-%H%M%S') \
--template gs://dataflow-templates/latest/Word_Count \
--inputFile gs://apache-beam-samples/shakespeare/kinglear.txt \
--output gs://<your-gcs-bucket>/wordcount/outputs
To run a Dataflow template from Python.
NOTE: To run locally, you'll need to create a service account key as a JSON file. Then export an environment variable called
GOOGLE_APPLICATION_CREDENTIALS
pointing it to your service account file.
import main as run_template
run_template.run(
project='your-gcp-project',
job='unique-job-name',
template='gs://dataflow-templates/latest/Word_Count',
parameters={
'inputFile': 'gs://apache-beam-samples/shakespeare/kinglear.txt',
'output': 'gs://<your-gcs-bucket>/wordcount/outputs',
}
)
To deploy this into a Cloud Function and run a Dataflow template via an HTTP request as a REST API.
PROJECT=$(gcloud config get-value project) \
REGION=$(gcloud config get-value functions/region)
# Deploy the Cloud Function.
gcloud functions deploy run_template \
--runtime python37 \
--trigger-http \
--region $REGION
# Call the Cloud Function via an HTTP request.
curl -X POST "https://$REGION-$PROJECT.cloudfunctions.net/run_template" \
-d project=$PROJECT \
-d job=wordcount-$(date +'%Y%m%d-%H%M%S') \
-d template=gs://dataflow-templates/latest/Word_Count \
-d inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt \
-d output=gs://<your-gcs-bucket>/wordcount/outputs