Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
api		api
core		core
docs		docs
functional-test		functional-test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
master_application_config-dev.yml		master_application_config-dev.yml
master_application_config-prod.yml		master_application_config-prod.yml
master_application_config-stage.yml		master_application_config-stage.yml
master_application_config-test.yml		master_application_config-test.yml
master_application_config.yml		master_application_config.yml

Repository files navigation

DataPull

DataPull is a self service Distributed ETL tool to join and transform data from heterogeneous datastores. It provides users an easy and consistent way to move data from one datastore to another. Supported datastores include, but are not limited to, SQLServer, MySql, Postgres, Cassandra, MongoDB and Kafka.

Features

JSON configuration-driven data movement - no Java/Scala knowledge needed
Join and transform data among heterogeneous datastores (including NoSQL datastores) using ANSI SQL
Deploys on Amazon AWS EMR and Fargate; but can run on any Spark cluster
Picks up datastore credentials stored in Hashicorp Vault, Amazon Secrets Manager
Execution logs and migration history configurable to Amazon AWS Cloudwatch, S3
Use built-in cron scheduler, or call REST API from external schedulers

... and many more features documented here

Run DataPull locally

Note: DataPull consists of two services, an API written in Java Spring Boot, and a Spark app written in Scala. Although Scala apps can run on JDK 11, per official docs it is recommended that Java 8 be used for compiling Scala code. The effort to upgrade to OpenJDK 11+ is tracked here

Build and execute within a Dockerised Spark environment

Pre-requisite: Docker Desktop

Clone this repo locally and checkout the master branch
```
git clone [email protected]:homeaway/datapull.git
```
build the Scala JAR from within the core folder
```
cd datapull/core
make build
```

Execute a sample JSON input file Input_Sample_filesystem-to-filesystem.json that moves data from a CSV file HelloWorld.csv to a folder of json files named SampleData_Json.

docker run -v $(pwd):/core -w /core -it --rm gettyimages/spark:2.2.1-hadoop-2.8 spark-submit --deploy-mode client --class core.DataPull target/DataMigrationFramework-1.0-SNAPSHOT-jar-with-dependencies.jar src/main/resources/Samples/Input_Sample_filesystem-to-filesystem.json local

Open the relative path target/classes/SampleData_Json to find the result of the DataPull i.e. the data from target/classes/SampleData/HelloWorld.csv transformed into JSON.

Build and debug within an IDE (IntelliJ)

Pre-requisite: IntelliJ with Scala plugin configured. Check out this Help page if this plugin is not installed.

Clone this repo locally and checkout the master branch
Open the folder core in IntelliJ IDE.
When prompted, add this project as a maven project.
By default, this source code is designed to execute a sample JSON input file Input_Sample_filesystem-to-filesystem.json that moves data from a CSV file HelloWorld.csv to a folder of json files named SampleData_Json.
Go to File > Project Structure... , and choose 1.8 (java version) as the Project SDK
Go to Run > Edit Configurations... , and do the following
- Create an Application configuration (use the + sign on the top left corner of the modal window)
- Set the Name to Debug
- Set the Main Class as Core.DataPull
- Use classpath of module Core.DataPull
- Set JRE to 1.8
- Click Apply and then OK
Click Run > Debug 'Debug' to start the debug execution
Open the relative path target/classes/SampleData_Json to find the result of the DataPull i.e. the data from target/classes/SampleData/HelloWorld.csv transformed into JSON.

Deploy DataPull to Amazon AWS

Deploying DataPull to Amazon AWS, involves

installing the DataPull API and Spark JAR in AWS Fargate, using this runbook
running DataPulls in AWS EMR, using this runbook

Bugs/Feature Requests

Please create an issue in this git repo, using the bug report or feature request templates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataPull

Features

Run DataPull locally

Build and execute within a Dockerised Spark environment

Build and debug within an IDE (IntelliJ)

Deploy DataPull to Amazon AWS

Bugs/Feature Requests

About

Releases 1

Packages

Contributors 22

Languages

License

homeaway/datapull

Folders and files

Latest commit

History

Repository files navigation

DataPull

Features

Run DataPull locally

Build and execute within a Dockerised Spark environment

Build and debug within an IDE (IntelliJ)

Deploy DataPull to Amazon AWS

Bugs/Feature Requests

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 22

Languages

Packages