Migrate Data from Redshift to BigQuery

All the scripts were used for a migration from Redshift to BigQuery.

The scripts are not generic. A lot of names, paths, etc. are hard-coded.

The time to run the migrations:

number of files = number of tables * number of days of data in tables

50 tables with 100 days data per table:
>= 5,000 files

Running them one after the other.
Assuming migrating 1 table with 100 days of data takes 10 min:
50 * 10 = 500 min <= 10h

Setup

This is not tested and complete. For a different migration project the scripts most likely need to be changed.

install virtualenv and virtualenvwrapper and create a venv
install Python packages

pip install -r requirements.txt

Clone the forked and modified BigShift from github

git clone https://github.com/RawIron/bigshift/tree/support-daily-partitions

install the required Ruby gems using bundle

gem install bundle
bundle install

add AWS access keys to your env

AWS_REGION=
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

install AWS command line tools

pip install awscli --upgrade --user
PATH="$HOME/.local/bin:$PATH"

create buckets on AWS S3 and GCP gcs

Tips

Do not migrate columns of "type" identity
Use the --steps option and run the steps one-by-one. Fix if anything broke and try again. Move on to the next step when ready.
Start with one table, one day
Next try all tables, one day
Remove the --partition_day option in migrate_partition.sh to migrate a non-partitioned table
Make a backup of your csv files
Control what is migrated or worked on by editing the appropriate csv file

Example Migration

The team "ef4" of the "zephyrus" company migrates 2 projects from RS to BQ

bora
ostro

Quick Start: DAY-Partitioned

Caution Please make sure the hard-coded names, paths etc. are correct.

Set the bigshift_home variable in migrate_partition.sh
Create a CSV file with: tablename,min(date(timestamp))

python3 db_tables.py --rs --daily --tables "your_project"

Run the migration: unload

# bigshift with --step unload
bash iter_table_partitions.sh "your_csv_file" "end_day" >n.out 2>&1

Search log file n.out for Postgres-Client errors

cat n.out | grep -B 5 PG: | grep "bash migrate" | cut -d " " -f -4 | uniq

print the expected count of folders

bash iter_table_partitions.sh --count "your_csv_file" "end_day" | wc -l

count the folders on S3

aws --profile prod-bora s3 ls s3://zephyrus-ef4-prod-bora-migrate/ | wc -l

when the counts are not equal find the missing ones with

bash iter_table_partitions.sh --count min_day.csv 20171004 | sed 's/ //' | sort >should.out
aws --profile prod-bora s3 ls s3://zephyrus-ef4-prod-bora-migrate/ | sed 's/.*PRE //' | sed 's/\///' | sort >is.out
diff is.out should.out

Run the migration: transfer from S3 to GCS

gcloud config configurations activate prod-bora
gsutil ls gs://zephyrus-ef4-prod-bora-migrate/ | wc -l

Drop and create the day-partitioned tables

This will wipe all the data in the tables. Make sure this is what you want

# bigshift with --step drop
bash iter_table_partitions.sh "your_csv_file" >n.out 2>&1

Run the migration: load

# bigshift with --step load
bash iter_table_partitions.sh "your_csv_file" "end_day" >n.out 2>&1

count rows in partitions for Redshift tables

python3 db_count.py --rs --daily --count "your_project" "end_day"

count rows in partitions for BigQuery tables

python3 db_count.py --bq --daily --count "your_project" "end_day"

Validate the migration

python3 db_diff.py --verify "your_project"

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
basic_stats_sort_column.py		basic_stats_sort_column.py
bora_migrate_partition.sh		bora_migrate_partition.sh
bq_column_distribution.ipynb		bq_column_distribution.ipynb
bq_copy_project.py		bq_copy_project.py
bq_drop_column.py		bq_drop_column.py
bq_lib.py		bq_lib.py
cmp_basic.sh		cmp_basic.sh
config.py		config.py
db_count.py		db_count.py
db_diff.py		db_diff.py
db_dist.py		db_dist.py
db_tables.py		db_tables.py
insertid_distribution.ipynb		insertid_distribution.ipynb
iter_table_partitions.sh		iter_table_partitions.sh
lib.py		lib.py
mv_sorted.sh		mv_sorted.sh
ostro_migrate_partition.sh		ostro_migrate_partition.sh
requirements.txt		requirements.txt
rs.py		rs.py
rs_column_distribution.ipynb		rs_column_distribution.ipynb
verify_visualize.ipynb		verify_visualize.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Migrate Data from Redshift to BigQuery

Setup

Tips

Example Migration

Quick Start: DAY-Partitioned

About

Releases

Packages

Languages

License

RawIron/migrate-redshift-to-bigquery-daily-partitions

Folders and files

Latest commit

History

Repository files navigation

Migrate Data from Redshift to BigQuery

Setup

Tips

Example Migration

Quick Start: DAY-Partitioned

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages