Skip to content

pukhrajborania/luigi-td-example

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Luigi + Treasure Data Workflow Example

Building the complex data pipeline on Treasure Data? Don't write adhoc scripts. This repository contains the example data workflow examples using Luigi, a Python-based robust workflow engine. This example uses Luigi-TD, which makes Luigi users easier to hit TD REST API from Luigi.

Sample Workflow

List of Example Tasks

./apps/examples/ directory contain a couple of basic workflow tasks, you can use as building blocks. All examples use Luigi-TD, a library to make it easier to use TD from Luigi.

Filename Description
single_hive.py Execute a Hive query, and waits for the completion.
single_presto.py Execute a Presto query, and waits for the completion.
query_and_download_as_csv.py Execute a Presto query, download the query result, and convert and store it as CSV file on local dir.
result_output.py Execute Hive query, write the results into the table in TD. Then execute Presto query to use the generated Table. Finally, download 2nd query result and store it as CSV on local dir.
hourly_hive.py This represents the hourly execusion of a Hive query. This script is supposed to be called by cron.py.
daily_hive.py This represents the daily execusion of a Hive query. This script is supposed to be called by cron.py.

How to Develop My Tasks?

You can of course add your own workflow.

# Create your app directory
$ mkdir -p ./apps/yours

# Copy from examples
$ cp ./apps/examples/single_hive.py ./apps/yours

# Modify (Yes, emacs)
$ emacs -nw ./apps/yours/single_hive.py

# Test
$ python ./apps/yours/single_hive.py YourTaskX --local-scheduler

# Commit
$ git add ./apps/yours/
$ git commit -a -m 'add new task'

Luigi Documentation is the great place to start learning the basics of Luigi. After that, Luigi-TD Documentation will give you the specifics about how to use TD + Luigi.

How to Deploy?

Ready to deploy your first workflow? Here's a couple of ways to get started.

Deploy on Heroku

This repository is ready to deploy on Heroku PaaS. Please just hit the button below, and will create Heroku app running cron.py who kicks workflows in hourly / daily basis.

Deploy

After that, please use clock Dyno instead of web dyno. You can configure via heroku command or from Heroku dashboard.

$ heroku scale web=0 --app <YOUR_APP_NAME>
$ heroku scale clock=1 --app <YOUR_APP_NAME>

For further information about cron.py, please check Scheduled Jobs with Custom Clock Processes in Python with APScheduler documentation by Heroku.

Please modify client.cfg (Luigi's configuration file) to change the error notification email address. Other configuration variables can be found here.

Deploy on Your Machine

To run this repository, you need to install python on your machine.

# Install required libraries
$ pip install -r ./requirements.txt

# Set your TD API Key (http://console.treasuredata.com/users/current)
$ export TD_API_KEY="..."

# Run specific Task
$ python ./apps/examples/single_hive.py TaskXXX --local-scheduler

# Remove intermediate results, and execute from scratch
$ rm -fr ./tmp/
$ python ./apps/examples/single_hive.py TaskXXX --local-scheduler

# Run periodic Task
$ python ./cron.py --local-scheduler

Please modify client.cfg (Luigi's configuration file) to change the error notification email address. Other configuration variables can be found here.

Resources

Support

Need a hand with something? Send us an email to [email protected] and we'll get back to you right away! For technical questions, use the treasure-data tag on Stack Overflow.

About

Example Repository for Building Complex Data Pipeline with Luigi +TD

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%