Building the complex data pipeline on Treasure Data? Don't write adhoc scripts. This repository contains the example data workflow examples using Luigi, a Python-based robust workflow engine. This example uses Luigi-TD, which makes Luigi users easier to hit TD REST API from Luigi.
./apps/examples/ directory contain a couple of basic workflow tasks, you can use as building blocks. All examples use Luigi-TD, a library to make it easier to use TD from Luigi.
Filename | Description |
---|---|
single_hive.py | Execute a Hive query, and waits for the completion. |
single_presto.py | Execute a Presto query, and waits for the completion. |
query_and_download_as_csv.py | Execute a Presto query, download the query result, and convert and store it as CSV file on local dir. |
result_output.py | Execute Hive query, write the results into the table in TD. Then execute Presto query to use the generated Table. Finally, download 2nd query result and store it as CSV on local dir. |
hourly_hive.py | This represents the hourly execusion of a Hive query. This script is supposed to be called by cron.py. |
daily_hive.py | This represents the daily execusion of a Hive query. This script is supposed to be called by cron.py. |
You can of course add your own workflow.
# Create your app directory
$ mkdir -p ./apps/yours
# Copy from examples
$ cp ./apps/examples/single_hive.py ./apps/yours
# Modify (Yes, emacs)
$ emacs -nw ./apps/yours/single_hive.py
# Test
$ python ./apps/yours/single_hive.py YourTaskX --local-scheduler
# Commit
$ git add ./apps/yours/
$ git commit -a -m 'add new task'
Luigi Documentation is the great place to start learning the basics of Luigi. After that, Luigi-TD Documentation will give you the specifics about how to use TD + Luigi.
Ready to deploy your first workflow? Here's a couple of ways to get started.
This repository is ready to deploy on Heroku PaaS. Please just hit the button below, and will create Heroku app running cron.py who kicks workflows in hourly / daily basis.
After that, please use clock
Dyno instead of web dyno. You can configure via heroku
command or from Heroku dashboard.
$ heroku scale web=0 --app <YOUR_APP_NAME>
$ heroku scale clock=1 --app <YOUR_APP_NAME>
For further information about cron.py
, please check Scheduled Jobs with Custom Clock Processes in Python with APScheduler documentation by Heroku.
Please modify client.cfg (Luigi's configuration file) to change the error notification email address. Other configuration variables can be found here.
To run this repository, you need to install python
on your machine.
# Install required libraries
$ pip install -r ./requirements.txt
# Set your TD API Key (http://console.treasuredata.com/users/current)
$ export TD_API_KEY="..."
# Run specific Task
$ python ./apps/examples/single_hive.py TaskXXX --local-scheduler
# Remove intermediate results, and execute from scratch
$ rm -fr ./tmp/
$ python ./apps/examples/single_hive.py TaskXXX --local-scheduler
# Run periodic Task
$ python ./cron.py --local-scheduler
Please modify client.cfg (Luigi's configuration file) to change the error notification email address. Other configuration variables can be found here.
Need a hand with something? Send us an email to [email protected] and we'll get back to you right away! For technical questions, use the treasure-data tag on Stack Overflow.