These Dataflow templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations, without a development environment. The technology under the hood which makes these operations possible is the Google Cloud Dataflow service combined with a set of Apache Beam SDK templated pipelines.
Google is providing this collection of pre-implemented Dataflow templates as a reference and to provide easy customization for developers wanting to extend their functionality.
As of November 18, 2021, our default branch is now named "main". This does not affect forks. If you would like your fork and its local clone to reflect these changes you can follow GitHub's branch renaming guide.
Maven commands should be run on the parent POM. An example would be:
mvn clean package -pl v2/pubsub-binary-to-bigquery -am
- Get Started
- Process Data Continuously (stream)
- Azure Eventhub to Pubsub
- Bigtable Change Streams to HBase Replicator
- Cloud Bigtable change streams to BigQuery
- Cloud Bigtable change streams to Cloud Storage
- Cloud Spanner change streams to BigQuery
- Cloud Spanner change streams to Cloud Storage
- Cloud Spanner change streams to Pub/Sub
- Cloud Storage Text to BigQuery (Stream)
- Data Masking/Tokenization from Cloud Storage to BigQuery (using Cloud DLP)
- Datastream to BigQuery
- Datastream to Cloud Spanner
- Datastream to SQL
- JMS to Pubsub
- Kafka to BigQuery
- Kafka to Cloud Storage
- Kinesis To Pubsub
- MongoDB to BigQuery (CDC)
- Mqtt to Pubsub
- Ordered change stream buffer to Source DB
- Pub/Sub Avro to BigQuery
- Pub/Sub CDC to Bigquery
- Pub/Sub Proto to BigQuery
- Pub/Sub Subscription or Topic to Text Files on Cloud Storage
- Pub/Sub Subscription to BigQuery
- Pub/Sub Topic to BigQuery
- Pub/Sub to Avro Files on Cloud Storage
- Pub/Sub to Datadog
- Pub/Sub to Elasticsearch
- Pub/Sub to JDBC
- Pub/Sub to Kafka
- Pub/Sub to MongoDB
- Pub/Sub to Pub/Sub
- Pub/Sub to Redis
- Pub/Sub to Splunk
- Pub/Sub to Text Files on Cloud Storage
- Pubsub to JMS
- Spanner Change Streams to Sink
- Synchronizing CDC data to BigQuery
- Text Files on Cloud Storage to Pub/Sub
- Process Data in Bulk (batch)
- AstraDB to BigQuery
- Avro Files on Cloud Storage to Cloud Bigtable
- Avro Files on Cloud Storage to Cloud Spanner
- BigQuery export to Parquet (via Storage API)
- BigQuery to Bigtable
- BigQuery to Datastore
- BigQuery to Elasticsearch
- BigQuery to MongoDB
- BigQuery to TensorFlow Records
- Cassandra to Cloud Bigtable
- Cloud Bigtable to Avro Files in Cloud Storage
- Cloud Bigtable to Parquet Files on Cloud Storage
- Cloud Bigtable to SequenceFile Files on Cloud Storage
- Cloud Spanner to Avro Files on Cloud Storage
- Cloud Spanner to Text Files on Cloud Storage
- Cloud Storage To Splunk
- Cloud Storage to Elasticsearch
- Dataplex JDBC Ingestion
- Dataplex: Convert Cloud Storage File Format
- Dataplex: Tier Data from BigQuery to Cloud Storage
- Firestore (Datastore mode) to BigQuery
- Firestore (Datastore mode) to Text Files on Cloud Storage
- Google Ads to BigQuery
- Google Cloud to Neo4j
- JDBC to BigQuery
- JDBC to BigQuery with BigQuery Storage API support
- JDBC to Pub/Sub
- MongoDB to BigQuery
- MySQL to BigQuery
- Parquet Files on Cloud Storage to Cloud Bigtable
- PostgreSQL to BigQuery
- SQLServer to BigQuery
- SequenceFile Files on Cloud Storage to Cloud Bigtable
- Text Files on Cloud Storage to BigQuery
- Text Files on Cloud Storage to BigQuery with BigQuery Storage API support
- Text Files on Cloud Storage to Cloud Spanner
- Text Files on Cloud Storage to Firestore (Datastore mode)
- Utilities
- Legacy Templates
For documentation on each template's usage and parameters, please see the official docs.
- Java 11
- Maven 3
Build the entire project using the maven compile command.
mvn clean compile
IntelliJ, by default, will often skip necessary Maven goals, leading to build failures. You can fix these in the Maven view by going to Module_Name > Plugins > Plugin_Name where Module_Name and Plugin_Name are the names of the respective module and plugin with the rule. From there, right-click the rule and select "Execute Before Build".
The list of known rules that require this are:
- common > Plugins > protobuf > protobuf:compile
- common > Plugins > protobuf > protobuf:test-compile
From either the root directory or v2/ directory, run:
mvn spotless:apply
This will format the code and add a license header. To verify that the code is formatted correctly, run:
mvn spotless:check
Once the template is staged on Google Cloud Storage, it can then be executed using the gcloud CLI tool. Please check Running classic templates or Using Flex Templates for more information.
Templates plugin was created to make the workflow of creating, testing and releasing Templates easier.
Before using the plugin, please make sure that the gcloud CLI is installed and up-to-date, and that the client is properly authenticated using:
gcloud init
gcloud auth application-default login
After authenticated, install the plugin into your local repository:
mvn clean install -pl plugins/templates-maven-plugin -am
To stage a Template, it is necessary to upload the images to Artifact Registry (for Flex templates) and copy the template to Cloud Storage.
Although there are different steps that depend on the kind of template being developed. The plugin allows a template to be staged using the following single command:
mvn clean package -PtemplatesStage \
-DskipTests \
-DprojectId="{projectId}" \
-DbucketName="{bucketName}" \
-DstagePrefix="images/$(date +%Y_%m_%d)_01" \
-DtemplateName="Cloud_PubSub_to_GCS_Text_Flex" \
-pl v2/googlecloud-to-googlecloud -am
Notes:
- Change
-pl v2/googlecloud-to-googlecloud
and-DtemplateName
to point to the specific Maven module where your template is located. Even though-pl
is not required, it allows the command to run considerably faster. - In case
-DtemplateName
is not specified, all templates for the module will be staged.
This repository can generate a terraform module that prompts users for template specific parameters and launch a Dataflow Job. To generate a template specific terraform module, see the instructions for classic and flex templates below.
The required plugin artifact dependencies are listed below:
- plugins/core-plugin/src/main/resources/terraform-classic-template.tf
- plugins/core-plugin/src/main/resources/terraform-classic-template.tf
These are outputs from the cicd/cmd/run-terraform-schema. See cicd/cmd/run-terraform-schema/README.md for further details.
mvn clean prepare-package \
-DskipTests \
-PtemplatesTerraform \
-pl v1 -am
Next, terraform fmt the modules after generating:
terraform fmt -recursive v1
The resulting terraform modules are generated in v1/terraform.
mvn clean prepare-package \
-DskipTests \
-PtemplatesTerraform \
-pl v2/googlecloud-togooglecloud -am
Next, terraform fmt the modules after generating:
terraform fmt -recursive v2
The resulting terraform modules are generated in v2/<source>-to-<sink>/terraform
,
for example v2/bigquery-to-bigtable/terraform.
Notes:
- Change
-pl v2/googlecloud-to-googlecloud
and-DtemplateName
to point to the specific Maven module where your template is located.
A template can also be executed on Dataflow, directly from the command line. The
command-line is similar to staging a template, but it is required to
specify -Dparameters
with the parameters that will be used when launching the
template. For example:
mvn clean package -PtemplatesRun \
-DskipTests \
-DprojectId="{projectId}" \
-DbucketName="{bucketName}" \
-Dregion="us-central1" \
-DtemplateName="Cloud_PubSub_to_GCS_Text_Flex" \
-Dparameters="inputTopic=projects/{projectId}/topics/{topicName},windowDuration=15s,outputDirectory=gs://{outputDirectory}/out,outputFilenamePrefix=output-,outputFilenameSuffix=.txt" \
-pl v2/googlecloud-to-googlecloud -am
Notes:
- When running a template,
-DtemplateName
is mandatory, as-Dparameters=
are different across templates. -PtemplatesRun
is self-contained, i.e., it is not required to run ** Deploying/Staging Templates** before. In case you want to run a previously staged template, the existing path can be provided as-DspecPath=gs://.../path
-DjobName="{name}"
may be informed if a specific name is desirable ( optional).- If you encounter the error
Template run failed: File too large
, try adding-DskipShade
to the mvn args.
To run integration tests, the developer plugin can be also used to stage template on-demand (in case the parameter -DspecPath=
is not specified).
For example, to run all the integration tests in a specific module (in the example below, v2/googlecloud-to-googlecloud
):
mvn clean verify \
-PtemplatesIntegrationTests \
-Dproject="{project}" \
-DartifactBucket="{bucketName}" \
-Dregion=us-central1 \
-pl v2/googlecloud-to-googlecloud -am
The parameter -Dtest=
can be given to test a single class (e.g., -Dtest=PubsubToTextIT
) or single test case (e.g., -Dtest=PubsubToTextIT#testTopicToGcs
).
The same happens when the test is executed from an IDE, just make sure to add the parameters -Dproject=
, -DartifactBucket=
and -Dregion=
as program or VM arguments.
A template requires more information than just a name and description. For example, in order to be used from the Dataflow UI, parameters need a longer help text to guide users, as well as proper types and validations to make sure parameters are being passed correctly.
We introduced annotations to have the source code as a single source of truth, along with a set of utilities / plugins to generate template-accompanying artifacts (such as command specs, parameter specs).
Every template must be annotated with @Template
. Existing templates can be
used for reference, but the structure is as follows:
@Template(
name = "BigQuery_to_Elasticsearch",
category = TemplateCategory.BATCH,
displayName = "BigQuery to Elasticsearch",
description = "A pipeline which sends BigQuery records into an Elasticsearch instance as JSON documents.",
optionsClass = BigQueryToElasticsearchOptions.class,
flexContainerName = "bigquery-to-elasticsearch")
public class BigQueryToElasticsearch {
A set of @TemplateParameter.{Type}
annotations were created to allow the
definition of options for a template, and the proper rendering in the UI, and
validations by the template launch service. Examples can be found in the
repository, but the general structure is as follows:
@TemplateParameter.Text(
order = 2,
optional = false,
regexes = {"[,a-zA-Z0-9._-]+"},
description = "Kafka topic(s) to read the input from",
helpText = "Kafka topic(s) to read the input from.",
example = "topic1,topic2")
@Validation.Required
String getInputTopics();
@TemplateParameter.GcsReadFile(
order = 1,
description = "Cloud Storage Input File(s)",
helpText = "Path of the file pattern glob to read from.",
example = "gs://your-bucket/path/*.csv")
String getInputFilePattern();
@TemplateParameter.Boolean(
order = 11,
optional = true,
description = "Whether to use column alias to map the rows.",
helpText = "If enabled (set to true) the pipeline will consider column alias (\"AS\") instead of the column name to map the rows to BigQuery.")
@Default.Boolean(false)
Boolean getUseColumnAlias();
@TemplateParameter.Enum(
order = 21,
enumOptions = {"INDEX", "CREATE"},
optional = true,
description = "Build insert method",
helpText = "Whether to use INDEX (index, allows upsert) or CREATE (create, errors on duplicate _id) with Elasticsearch bulk requests.")
@Default.Enum("CREATE")
BulkInsertMethodOptions getBulkInsertMethod();
Note: order
is relevant for templates that can be used from the UI, and
specify the relative order of parameters.
This annotation should be used by classes that are used for integration tests of
other templates. This is used to wire a specific IT
class with a template, and
allows environment preparation / proper template staging before tests are
executed on Dataflow.
Template tests have to follow this general format (please note
the @TemplateIntegrationTest
annotation and the TemplateTestBase
super-class):
@TemplateIntegrationTest(PubsubToText.class)
@RunWith(JUnit4.class)
public final class PubsubToTextIT extends TemplateTestBase {
Please refer to Templates Plugin
to use and validate such annotations.
User-defined functions (UDFs) allow you to customize a template's functionality by providing a short JavaScript function without having to maintain the entire codebase. This is useful in situations which you'd like to rename fields, filter values, or even transform data formats before output to the destination. All UDFs are executed by providing the payload of the element as a string to the JavaScript function. You can then use JavaScript's in-built JSON parser or other system functions to transform the data prior to the pipeline's output. The return statement of a UDF specifies the payload to pass forward in the pipeline. This should always return a string value. If no value is returned or the function returns undefined, the incoming record will be filtered from the output.
Template | UDF Input Type | Input Description | UDF Output Type | Output Description |
---|---|---|---|---|
Datastore Bulk Delete | String | A JSON string of the entity | String | A JSON string of the entity to delete; filter entities by returning undefined |
Datastore to Pub/Sub | String | A JSON string of the entity | String | The payload to publish to Pub/Sub |
Datastore to GCS Text | String | A JSON string of the entity | String | A single-line within the output file |
GCS Text to BigQuery | String | A single-line within the input file | String | A JSON string which matches the destination table's schema |
Pub/Sub to BigQuery | String | A string representation of the incoming payload | String | A JSON string which matches the destination table's schema |
Pub/Sub to Datastore | String | A string representation of the incoming payload | String | A JSON string of the entity to write to Datastore |
Pub/Sub to Splunk | String | A string representation of the incoming payload | String | The event data to be sent to Splunk HEC events endpoint. Must be a string or a stringified JSON object |
For a comprehensive list of samples, please check our udf-samples folder.
/**
* A transform which adds a field to the incoming data.
* @param {string} inJson
* @return {string} outJson
*/
function transform(inJson) {
var obj = JSON.parse(inJson);
obj.dataFeed = "Real-time Transactions";
obj.dataSource = "POS";
return JSON.stringify(obj);
}
/**
* A transform function which only accepts 42 as the answer to life.
* @param {string} inJson
* @return {string} outJson
*/
function transform(inJson) {
var obj = JSON.parse(inJson);
// only output objects which have an answer to life of 42.
if (obj.hasOwnProperty('answerToLife') && obj.answerToLife === 42) {
return JSON.stringify(obj);
}
}
This repository contains generated documentation, which contains a list of parameters and instructions on how to customize and/or build every template.
To generate the documentation for all templates, the following command can be used:
mvn clean prepare-package \
-DskipTests \
-PtemplatesSpec
Templates are released in a weekly basis (best-effort) as part of the efforts to keep Google-provided Templates updated with latest fixes and improvements.
In case desired, you can stage and use your own changes using the Staging (Deploying) Templates
steps.
To execute the release of multiple templates, we provide a single Maven command to release Templates, which is a shortcut to stage all templates while running additional validations.
mvn clean verify -PtemplatesRelease \
-DprojectId="{projectId}" \
-DbucketName="{bucketName}" \
-DlibrariesBucketName="{bucketName}-libraries" \
-DstagePrefix="$(date +%Y_%m_%d)-00_RC00"
As part of the Templates development process, we release the common artifact snapshots to Maven Central, not modules that contain finalized templates. This allows users to consume those resources and modules without forking the entire project, while keeping artifacts at a reasonable size.
In order to release artifacts, ~/.m2/settings.xml
should be configured to contain Sonatype's username and password:
<servers>
<server>
<id>ossrh</id>
<username>(user)</username>
<password>(password)</password>
</server>
</servers>
And the command to release (for example, the development plugin and Spanner together):
mvn clean deploy -am -Prelease \
-pl plugins/templates-maven-plugin \
-pl v2/spanner-common
If you intend to use those resources in an external project, your pom.xml
should include:
<repositories>
<repository>
<id>ossrh</id>
<url>https://oss.sonatype.org/content/repositories/snapshots</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>ossrh</id>
<url>https://oss.sonatype.org/content/repositories/snapshots</url>
</pluginRepository>
</pluginRepositories>
- Dataflow Templates - basic template concepts.
- Google-provided Templates - official documentation for templates provided by Google (the source code is in this repository).
- Dataflow Cookbook: Blog, GitHub Repository - pipeline examples and practical solutions to common data processing challenges.
- Dataflow Metrics Collector - CLI tool to collect dataflow resource & execution metrics and export to either BigQuery or Google Cloud Storage. Useful for comparison and visualization of the metrics while benchmarking the dataflow pipelines using various data formats, resource configurations etc
- Apache Beam
- Overview
- Quickstart: Java, Python, Go
- Tour of Beam - an interactive tour with learning topics covering core Beam concepts from simple ones to more advanced ones.
- Beam Playground - an interactive environment to try out Beam transforms and examples without having to install Apache Beam.
- Beam College - hands-on training and practical tips, including video recordings of Apache Beam and Dataflow Templates lessons.
- Getting Started with Apache Beam - Quest - A 5 lab series that provides a Google Cloud certified badge upon completion.