Skip to content

renovate-bot/GoogleCloudPlatform-_-cloud-data-quality

Repository files navigation

Cloud Data Quality Engine

alpha build-test status Code style: black

Introductions

Declarative Data Quality Configs

Rule Bindings: Defines a single Data Quality validation routine. Each value declared in entity_id, column_id, filter_id, and rule_id is a lookup key for the more detailed configurations that must be defined in their respective configurations files.

rule_bindings:
  T2_DQ_1_EMAIL:
    entity_id: TEST_TABLE
    column_id: VALUE
    row_filter_id: DATA_TYPE_EMAIL
    rule_ids:
      - NOT_NULL_SIMPLE
      - REGEX_VALID_EMAIL
      - CUSTOM_SQL_LENGTH_LE_30
    metadata:
      brand: one

  T3_DQ_1_EMAIL_DUPLICATE:
    entity_id: TEST_TABLE
    column_id: VALUE
    row_filter_id: DATA_TYPE_EMAIL
    rule_ids:
      - NO_DUPLICATES_IN_COLUMN_GROUPS:
          column_names: "value"
      - NOT_NULL_SIMPLE
    metadata:
      brand: one

Rules: Defines reusable sets of validation logic for validating data.

rules:
  NOT_NULL_SIMPLE:
    rule_type: NOT_NULL

  REGEX_VALID_EMAIL:
    rule_type: REGEX
    params:
      pattern: |-
        ^[^@]+[@]{1}[^@]+$

  CUSTOM_SQL_LENGTH_LE_30:
    rule_type: CUSTOM_SQL_EXPR
    params:
      custom_sql_expr: |-
        LENGTH( $column ) <= 30

  NO_DUPLICATES_IN_COLUMN_GROUPS:
    rule_type: CUSTOM_SQL_STATEMENT
    params:
      custom_sql_arguments:
        - column_names
      custom_sql_statement: |-
        select
          $column_names
        from data
        group by $column_names
        having count(*) > 1

We will add more default rule types over time. For the time being, you can create arbitrarily complex SQL rules with CUSTOM_SQL_EXPR and CUSTOM_SQL_STATEMENT. Only CUSTOM_SQL_STATEMENT supports parametrization.

Filters: Defines how each rule binding can be filtered

row_filters:
  NONE:
    filter_sql_expr: |-
      True

  DATA_TYPE_EMAIL:
    filter_sql_expr: |-
      contact_type = 'email'

Entities: defines the target data tables as validation target.

entities:
  TEST_TABLE:
    source_database: BIGQUERY
    table_name: contact_details
    database_name: dq_test
    instance_name: kthxbayes-sandbox
    environment_override:
      TEST:
        environment: test
        override:
          database_name: does_not_exists
          instance_name: does_not_exists
    columns:
      KEY:
        name: key
        data_type: STRING
        description: |-
          contact detail type
      VALUE:
        name: value
        data_type: STRING
        description: |-
          contact detail
      TS:
        name: ts
        data_type: DATETIME
        description: |-
          updated timestamp

How to use

Setting up dbt

The project uses dbt to execute SQL queries against BigQuery.

To set up DBT, you must configure connection profiles to a BigQuery project on GCP using a profiles.yml file.

You can read about dbt's profiles.yml configurations for BigQuery here.

You can start by copying the template at:

cp profiles.yml.template profiles.yml

The project and dataset configurations in profile.yml is where the DQ summary output table will be created.

To create the test dataset used in the code's test-suites and in the following examples, run (after installing dbt in a Python virtualenv with make install):

dbt seed

Building the project using Bazel

This project uses Bazel to build an executable python binary including its interpreter and all dependencies, including dbt.

It depends on go get to install a local bazelisk binary to build Bazel.

A Makefile is provided to simplify build and development steps.

To build the module:

make build

and run it (this will show the help text):

python bazel-bin/clouddq/clouddq_patched.zip

The Python zip have been tested with Python versions >=2.7.17 and >=3.8.6. The make build command have been tested to work on debian, ubuntu, and mac-os. There is no windows support yet.

The Python zip includes the top-level macros directory, which will be hard-coded in the executable and cannot be changed at runtime.

The CLI expects the paths for dbt_profiles_dir, dbt_path, environment_target, and configs_path to be provided at runtime. These directories can be customized to meet your needs.

Usage

Example command to execute two rule binding:

python bazel-bin/clouddq/clouddq_patched.zip \
    T2_DQ_1_EMAIL,T3_DQ_1_EMAIL_DUPLICATE \
    configs \
    --metadata='{"test":"test"}' \
    --dbt_profiles_dir=. \
    --dbt_path=dbt \
    --environment_target=dev

Example command to execute all rule bindings in a path containing multiple YAML config files:

python bazel-bin/clouddq/clouddq_patched.zip \
    ALL \
    configs \
    --metadata='{"test":"test"}' \
    --dbt_profiles_dir=. \
    --dbt_path=dbt \
    --environment_target=dev

Development

To run the CLI without building the zip file:

bin/bazelisk run //clouddq:clouddq -- \
  T2_DQ_1_EMAIL \
  $(pwd)/configs/rule_bindings/team-2-rule-bindings.yml \
  --metadata='{"test":"test"}' \
  --dbt_profiles_dir="$(pwd)" \
  --dbt_path="$(pwd)/dbt" \
  --environment_target=dev

As bazel run execute the code in a sandbox, non-absolute paths will be relative to the sandbox path not the current path. Ensure you are passing absolute paths to the command line arguments such as dbt_profiles_dir, dbt_path, configs_path, etc... Additionally, you may want to run source scripts/common.sh && confirm_gcloud_login to check that your gcloud credentials are set up correctly.

We do not provide a make run command because make is not designed to allow input arguments to the CLI.

To run unit tests:

make test

To run a particular test:

make test test_templates

To apply linting:

make lint

See more options at:

make help

Note that you may want to update the bazel cache in .bazelrc to a GCS bucket you have access to.

build --remote_cache=https://storage.googleapis.com/<your_gcs_bucket_name>

Development without Bazel

To install clouddq in a local python virtualenv using poetry:

make install

Running make install will automatically enter a new virtualenv using poetry shell before installing clouddq in the virtualenv. If Python version 3.8.6 is not found, the script will promt you on whether you want to install it using pyenv.

To enter the virtualenv by spawning a new shell at "$SHELL", run:

poetry shell

To exit the virtualenv, run:

exit

To unset the local pyenv interpreter, run:

pyenv local --unset

License

CloudDQ is licensed under the Apache License version 2.0. This is not an official Google product.

About

Data Quality Engine for BigQuery

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 86.9%
  • Shell 9.4%
  • Starlark 3.3%
  • Makefile 0.4%