Rule Bindings:
Defines a single Data Quality validation routine.
Each value declared in entity_id
, column_id
, filter_id
, and rule_id
is a lookup key for the more detailed configurations that must be defined in their respective configurations files.
rule_bindings:
T2_DQ_1_EMAIL:
entity_id: TEST_TABLE
column_id: VALUE
row_filter_id: DATA_TYPE_EMAIL
rule_ids:
- NOT_NULL_SIMPLE
- REGEX_VALID_EMAIL
- CUSTOM_SQL_LENGTH_LE_30
metadata:
brand: one
T3_DQ_1_EMAIL_DUPLICATE:
entity_id: TEST_TABLE
column_id: VALUE
row_filter_id: DATA_TYPE_EMAIL
rule_ids:
- NO_DUPLICATES_IN_COLUMN_GROUPS:
column_names: "value"
- NOT_NULL_SIMPLE
metadata:
brand: one
Rules: Defines reusable sets of validation logic for validating data.
rules:
NOT_NULL_SIMPLE:
rule_type: NOT_NULL
REGEX_VALID_EMAIL:
rule_type: REGEX
params:
pattern: |-
^[^@]+[@]{1}[^@]+$
CUSTOM_SQL_LENGTH_LE_30:
rule_type: CUSTOM_SQL_EXPR
params:
custom_sql_expr: |-
LENGTH( $column ) <= 30
NO_DUPLICATES_IN_COLUMN_GROUPS:
rule_type: CUSTOM_SQL_STATEMENT
params:
custom_sql_arguments:
- column_names
custom_sql_statement: |-
select
$column_names
from data
group by $column_names
having count(*) > 1
We will add more default rule types over time. For the time being, you can create arbitrarily complex SQL rules with CUSTOM_SQL_EXPR
and CUSTOM_SQL_STATEMENT
. Only CUSTOM_SQL_STATEMENT
supports parametrization.
Filters: Defines how each rule binding can be filtered
row_filters:
NONE:
filter_sql_expr: |-
True
DATA_TYPE_EMAIL:
filter_sql_expr: |-
contact_type = 'email'
Entities: defines the target data tables as validation target.
entities:
TEST_TABLE:
source_database: BIGQUERY
table_name: contact_details
database_name: dq_test
instance_name: kthxbayes-sandbox
environment_override:
TEST:
environment: test
override:
database_name: does_not_exists
instance_name: does_not_exists
columns:
KEY:
name: key
data_type: STRING
description: |-
contact detail type
VALUE:
name: value
data_type: STRING
description: |-
contact detail
TS:
name: ts
data_type: DATETIME
description: |-
updated timestamp
The project uses dbt to execute SQL queries against BigQuery.
To set up DBT, you must configure connection profiles to a BigQuery project on GCP using a profiles.yml
file.
You can read about dbt's profiles.yml configurations for BigQuery here.
You can start by copying the template at:
cp profiles.yml.template profiles.yml
The project
and dataset
configurations in profile.yml
is where the DQ summary output table will be created.
To create the test dataset used in the code's test-suites and in the following examples, run (after installing dbt
in a Python virtualenv with make install
):
dbt seed
This project uses Bazel to build an executable python binary including its interpreter and all dependencies, including dbt.
It depends on go get
to install a local bazelisk binary to build Bazel.
A Makefile
is provided to simplify build and development steps.
To build the module:
make build
and run it (this will show the help text):
python bazel-bin/clouddq/clouddq_patched.zip
The Python zip have been tested with Python versions >=2.7.17
and >=3.8.6
. The make build
command have been tested to work on debian
, ubuntu
, and mac-os
. There is no windows
support yet.
The Python zip includes the top-level macros
directory, which will be hard-coded in the executable and cannot be changed at runtime.
The CLI expects the paths for dbt_profiles_dir
, dbt_path
, environment_target
, and configs_path
to be provided at runtime. These directories can be customized to meet your needs.
Example command to execute two rule binding:
python bazel-bin/clouddq/clouddq_patched.zip \
T2_DQ_1_EMAIL,T3_DQ_1_EMAIL_DUPLICATE \
configs \
--metadata='{"test":"test"}' \
--dbt_profiles_dir=. \
--dbt_path=dbt \
--environment_target=dev
Example command to execute all rule bindings in a path containing multiple YAML config files:
python bazel-bin/clouddq/clouddq_patched.zip \
ALL \
configs \
--metadata='{"test":"test"}' \
--dbt_profiles_dir=. \
--dbt_path=dbt \
--environment_target=dev
To run the CLI without building the zip file:
bin/bazelisk run //clouddq:clouddq -- \
T2_DQ_1_EMAIL \
$(pwd)/configs/rule_bindings/team-2-rule-bindings.yml \
--metadata='{"test":"test"}' \
--dbt_profiles_dir="$(pwd)" \
--dbt_path="$(pwd)/dbt" \
--environment_target=dev
As bazel run
execute the code in a sandbox, non-absolute paths will be relative to the sandbox path not the current path. Ensure you are passing absolute paths to the command line arguments such as dbt_profiles_dir
, dbt_path
, configs_path
, etc... Additionally, you may want to run source scripts/common.sh && confirm_gcloud_login
to check that your gcloud
credentials are set up correctly.
We do not provide a make run
command because make
is not designed to allow input arguments to the CLI.
To run unit tests:
make test
To run a particular test:
make test test_templates
To apply linting:
make lint
See more options at:
make help
Note that you may want to update the bazel cache in .bazelrc
to a GCS bucket you have access to.
build --remote_cache=https://storage.googleapis.com/<your_gcs_bucket_name>
To install clouddq
in a local python virtualenv using poetry:
make install
Running make install
will automatically enter a new virtualenv using poetry shell
before installing clouddq
in the virtualenv. If Python version 3.8.6
is not found, the script will promt you on whether you want to install it using pyenv
.
To enter the virtualenv by spawning a new shell at "$SHELL", run:
poetry shell
To exit the virtualenv, run:
exit
To unset the local pyenv
interpreter, run:
pyenv local --unset
CloudDQ is licensed under the Apache License version 2.0. This is not an official Google product.