GitHub - fvgm-spec/airbyte-md-hackaton: This repository is intended to store the material for Airbyte MotherDuck Hackaton

Complete ETL pipeline using Airbyte Mother Duck connector

The Dataset

The dataset used for this project is a synthetic one generated using the versatile Python Faker module, it basically sets a FinancialProvider class to generate multiple investment_types, transaction_types and risk_profiles for further analysis by the end users in the UI that will be implemented using Evidence.

Bellow is a sample of the data_generator.py script that generates synthetic financial datasets. The generator creates interconnected CSV files containing customers information, accounts, transactions, and investments.:

import faker
from faker.providers import BaseProvider
import pandas as pd
import random
import uuid
import argparse
import os
from datetime import date

class FinancialProvider(BaseProvider):
    def investment_type(self):
        """Generate a realistic investment type."""
        return random.choice([
            'Stocks', 'Bonds', 'Mutual Funds', 'ETFs', 
            'Cryptocurrency', 'Real Estate Investment Trust', 
            'Commodities', 'Government Bonds'
        ])
    
    def transaction_type(self):
        """Generate a transaction type."""
        return random.choice([
            'Deposit', 'Withdrawal', 'Transfer', 
            'Purchase', 'Sale', 'Dividend Reinvestment'
        ])
    
    def risk_profile(self):
        """Generate an investment risk profile."""
        return random.choice([
            'Low Risk', 'Medium Risk', 'High Risk', 
            'Aggressive', 'Conservative', 'Balanced'
        ])

Features

The script generates 4 related datasets:
- Customers (including demographics, income, credit score)
- Accounts (types, risk profiles)
- Transactions (deposits, withdrawals, purchases)
- Investments (types, amounts, sectors)
Creates UUID-based relationships between tables
Uses realistic fake data for names, emails, and locations
Organizes output in date-based directory structure

Installation

pip install faker pandas

Usage

python generator.py --num_customers 100 --num_transactions 200 --output_data generated_data

Arguments

--num_customers: Number of customers to generate (default: 100)
--num_transactions: Number of transactions to generate (default: 200)
--output_data: Directory to save generated CSV files (default: 'generated_data')

Output Structure

generated_data/
└── YYYY-MM-DD/
    ├── customers.csv
    ├── accounts.csv
    ├── transactions.csv
    └── investments.csv

Data Schema

customers.csv

customer_id (UUID)
customer_name
customer_email
customer_age (18-75)
annual_income ($20k-$500k)
credit_score (300-850)
country
city

accounts.csv

account_id (UUID)
customer_id (UUID)
account_type
risk_profile
opening_date

The Workflow

I wanted to create a standalone workflow that generate each a of datasets required for the project in a daily basis, so GitHub Actions is ideal for this purpose, it can be set as an automated system that can run in a cron based schedule, and finally store the generated daily data in AWS S3.

System Overview

Daily generation of financial datasets (customers, accounts, transactions, investments)
Random variation in data volume
Automatic upload to AWS S3
Manual trigger option via workflow_dispatch

Features

Monitoring/Maintenance: Built-in logging, execution history, error notifications, easy workflow modifications via email.
Infrastructure Benefits: Free for public repositories, no need for dedicated servers, built-in secret management.
Automation: Scheduled execution (cron: '0 0 * * *'), manual triggers via workflow_dispatch, environment setup automation, dependency management.

Synthetic data genaration action

In the case of this project the workflow performs the installation of basic dependencies on every run, then exewcutes the data_generator.py script directly from the GitHub repo, and as a last step, automatically stores te daily genarated data in a previously specified S3 bucket, that will be consumed by Airbyte during the Data Integration process.

...
    - name: Generate Financial Dataset
      env:
        NUM_CUSTOMERS: ${{ steps.random.outputs.customers }}
        NUM_TRANSACTIONS: ${{ steps.random.outputs.transactions }}
      run: |
        mkdir -p generated_data
        python data_generator.py \
          --num_customers $NUM_CUSTOMERS \
          --num_transactions $NUM_TRANSACTIONS \
          --output_data generated_data
    
    - name: Configure AWS Credentials
      uses: aws-actions/configure-aws-credentials@v4
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: us-east-1
    
    - name: Upload to S3
      run: |
        for file in generated_data/*/*.csv; do
          aws s3 cp "$file" "s3://${{ secrets.S3_BUCKET_NAME }}/financial_data/$(basename $(dirname $file))/$(basename $file)"
        done

Airbyte Data Integration

Once the data extraction backend infrastructure was implemented, it's time to set the Airbyte connections. It basically sets 4 data streams for each of the datasets stored in the S3 bucket:

Each single stream syncs the data using the Aibyte S3 source connector from the S3 bucket according to the search pattern financial-data1215/financial_data/YYYY-MM-DD/, this way all folders identified with the corresponding date by each folder will be integrated from the customers, accounts, transactions, and investments CSV files to be replicated to MotherDuck destination:

For the MotherDuck Airbyte connector, there are 4 active connections corresponding to each of the 4 financial datasets:

The search pattern set in the MotherDuck connector settings in Airbyte compiles all the CSV files in a single stream to be dumped in a database table in MotherDuck Warehouse.

The Data Warehouse

As defined in MotherDuck Website:

MotherDuck is a cloud-native data warehouse, built on top of DuckDB, a fast in-process analytical database. It inherits some features from DuckDB that present opportunities to think differently about data warehousing methods in order to achieve high levels of performance and simplify the experience.

Motherduck UI is very flexible to work with multiple data sources and users in a colaborative way. In the image above I have already executed the data integration process from the S3 data generated by the GitHub Actions process, tables represented by the purple arrows:

customerss3
accountss3
transactionss3
investmentss3

I also created some dbt models that generated some addittional tables in the MotherDuck Warehouse, tables represented by the red arrows:

customer_profile
investment_analysis

The UI

The UI where the users will be able to interact with the data is developed using the open-source BI tool Evidence

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
evidence_project		evidence_project
financial_dw		financial_dw
img		img
utils		utils
.gitignore		.gitignore
README.md		README.md
data_generator.py		data_generator.py
notebook_test.ipynb		notebook_test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Complete ETL pipeline using Airbyte Mother Duck connector

The Dataset

Features

Installation

Usage

Arguments

Output Structure

Data Schema

The Workflow

System Overview

Features

Synthetic data genaration action

Airbyte Data Integration

The Data Warehouse

The UI

About

Releases

Packages

Languages

fvgm-spec/airbyte-md-hackaton

Folders and files

Latest commit

History

Repository files navigation

Complete ETL pipeline using Airbyte Mother Duck connector

The Dataset

Features

Installation

Usage

Arguments

Output Structure

Data Schema

The Workflow

System Overview

Features

Synthetic data genaration action

Airbyte Data Integration

The Data Warehouse

The UI

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages