This sample demonstrate how to use cryptographic encryption keys for the I/O connectors in an Apache Beam pipeline. For more information, see the Using customer-managed encryption keys docs page.
Follow the
Getting started with Google Cloud Dataflow
page, and make sure you have a Google Cloud project with billing enabled
and a service account JSON key set up in your GOOGLE_APPLICATION_CREDENTIALS
environment variable.
Additionally, for this sample you need the following:
-
Enable the APIs: BigQuery and Cloud KMS API.
-
Create a Cloud Storage bucket.
export BUCKET=your-gcs-bucket gsutil mb gs://$BUCKET
-
Create a symmetric key ring. For best results, use a regional location. This example uses a
global
key for simplicity.export KMS_KEYRING=samples-keyring export KMS_KEY=samples-key # Create a key ring. gcloud kms keyrings create $KMS_KEYRING --location global # Create a key. gcloud kms keys create $KMS_KEY --location global \ --keyring $KMS_KEYRING --purpose encryption
Note: Although you can destroy the key version material, you cannot delete keys and key rings. Key rings and keys do not have billable costs or quota limitations, so their continued existence does not impact costs or production limits.
-
Grant Encrypter/Decrypter permissions to the Dataflow, Compute Engine, and BigQuery service accounts. This grants your Dataflow, Compute Engine and BigQuery service accounts the permission to encrypt and decrypt with the CMEK you specify. The Dataflow workers use these service accounts when running the pipeline, which is different from the user service account used to start the pipeline.
export PROJECT=$(gcloud config get-value project) export PROJECT_NUMBER=$(gcloud projects list --filter $PROJECT --format "value(PROJECT_NUMBER)") # Grant Encrypter/Decrypter permissions to the Dataflow service account. gcloud projects add-iam-policy-binding $PROJECT \ --member serviceAccount:service-$PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com \ --role roles/cloudkms.cryptoKeyEncrypterDecrypter # Grant Encrypter/Decrypter permissions to the Compute Engine service account. gcloud projects add-iam-policy-binding $PROJECT \ --member serviceAccount:service-$PROJECT_NUMBER@compute-system.iam.gserviceaccount.com \ --role roles/cloudkms.cryptoKeyEncrypterDecrypter # Grant Encrypter/Decrypter permissions to the BigQuery service account. gcloud projects add-iam-policy-binding $PROJECT \ --member serviceAccount:bq-$PROJECT_NUMBER@bigquery-encryption.iam.gserviceaccount.com \ --role roles/cloudkms.cryptoKeyEncrypterDecrypter
-
Clone the
python-docs-samples
repository.git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
-
Navigate to the sample code directory.
cd python-docs-samples/dataflow/encryption-keys
-
Create a virtual environment and activate it.
virtualenv env source env/bin/activate
Once you are done, you can deactivate the virtualenv and go back to your global Python environment by running
deactivate
. -
Install the sample requirements.
pip install -U -r requirements.txt
The following sample gets some data from the
NASA wildfires public BigQuery dataset
using a customer-managed encryption key, and dump that data into the specified output_bigquery_table
using the same customer-managed encryption key.
Make sure you have the following variables set up:
# Set the project ID, GCS bucket and KMS key.
export PROJECT=$(gcloud config get-value project)
export BUCKET=your-gcs-bucket
# Set the region for the Dataflow job.
# https://cloud.google.com/compute/docs/regions-zones/
export REGION=us-central1
# Set the KMS key ID.
export KMS_KEYRING=samples-keyring
export KMS_KEY=samples-key
export KMS_KEY_ID=$(gcloud kms keys list --location global --keyring $KMS_KEYRING --filter $KMS_KEY --format "value(NAME)")
# Output BigQuery dataset and table name.
export DATASET=samples
export TABLE=dataflow_kms
Create the BigQuery dataset where the output table resides.
# Create the BigQuery dataset.
bq mk --dataset $PROJECT:$DATASET
To run the sample using the Dataflow runner.
python bigquery_kms_key.py \
--output_bigquery_table $PROJECT:$DATASET.$TABLE \
--kms_key $KMS_KEY_ID \
--project $PROJECT \
--runner DataflowRunner \
--temp_location gs://$BUCKET/samples/dataflow/kms/tmp \
--region $REGION
Note: To run locally you can omit the
--runner
command line argument and it defaults to theDirectRunner
.
You can check your submitted Cloud Dataflow jobs in the
GCP Console Dataflow page or by using gcloud
.
gcloud dataflow jobs list
Finally, check the contents of the BigQuery table.
bq query --use_legacy_sql=false "SELECT * FROM `$PROJECT.$DATASET.$TABLE`"
To avoid incurring charges to your GCP account for the resources used:
# Remove only the files created by this sample.
gsutil -m rm -rf "gs://$BUCKET/samples/dataflow/kms"
# [optional] Remove the Cloud Storage bucket.
gsutil rb gs://$BUCKET
# Remove the BigQuery table.
bq rm -f -t $PROJECT:$DATASET.$TABLE
# [optional] Remove the BigQuery dataset and all its tables.
bq rm -rf -d $PROJECT:$DATASET
# Revoke Encrypter/Decrypter permissions to the Dataflow service account.
gcloud projects remove-iam-policy-binding $PROJECT \
--member serviceAccount:service-$PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com \
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
# Revoke Encrypter/Decrypter permissions to the Compute Engine service account.
gcloud projects remove-iam-policy-binding $PROJECT \
--member serviceAccount:service-$PROJECT_NUMBER@compute-system.iam.gserviceaccount.com \
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
# Revoke Encrypter/Decrypter permissions to the BigQuery service account.
gcloud projects remove-iam-policy-binding $PROJECT \
--member serviceAccount:bq-$PROJECT_NUMBER@bigquery-encryption.iam.gserviceaccount.com \
--role roles/cloudkms.cryptoKeyEncrypterDecrypter