This directory contains the code for all the ETL jobs that WhereHows can run.
- HDFS Dataset
- Teradata Dataset
- Oracle Dataset
- Hive Dataset
- Azkaban Execution
- Oozie Execution
- Appworx Execution
- Azkaban Lineage
- Appworx Lineage
- Elastic Search
- LDAP Directory
- Druid Dataset
The ETL jobs are scheduled and configured via job files. Please refer to this directory for more details.
Collect dataset metadata from HDFS.
Major related file: wherehows-hadoop
The standalone module 'wherehows-hadoop' is responsible for this process.
In a real production environment, the machine used to run this ETL job is usually different from the machine that is the gateway to Hadoop. So we need to copy the runnable JAR file to a remote machine, execute remotely, and copy back the result.
At compile time, the module wherehows-hadoop
is packaged into a standalone JAR file.
At runtime, a Jython script copies it to the Hadoop gateway, remotely runs it on the Hadoop gateway, and copies the result back.
Inside the module, we use a whitelist of folders (configured through parameters) as the starting point to scan through the folders and files. After abstract at the dataset level, we then extract schema, sample data, and related metadata from them. The final step is to store this into two result files: metadata file and sample data file.
Major related file: HdfsTransform.py
Transform the JSON output into CSV format for easy loading.
Major related file: HdfsLoad.py
Load into MySQL database. related tables : dict_dataset, dict_dataset_sample, dict_field_detail
Collect dataset metadata from Teradata.
Major related file: TeradataExtract.py
Get metadata from Teradata DBC databases and store it in a local JSON file.
Major source tables: DBC.Tables, DBC.Columns, DBC.ColumnStatsV, DBC.TableSize, DBC.Indieces, DBC.IndexConstraints
'Select top 10' to get sample data.
Major related file: TeradataTransform.py
Transform the JSON output into CSV format for easy loading.
Major related file: TeradataLoad.py
Load into MySQL database.
Related tables: dict_dataset, dict_field_detail, dict_dataset_sample
Collect dataset metadata from Oracle DB.
Major related file: OracleExtract.py
Connect to Oracle database to get all the table/column/comments information excluding the databases in the exclude list. Extra table information including indices, constraints and partitions are also fetched. The results are formatted and stored in two CSV files, one for table records and the other for field records.
Major source tables: ALL_TABLES, ALL_TAB_COLUMNS, ALL_COL_COMMENTS, ALL_INDEXES, ALL_IND_COLUMNS, ALL_CONSTRAINTS, ALL_PART_KEY_COLUMNS
Not needed.
Major related file: OracleLoad.py
Load into MySQL database, similar to HiveLoad or HdfsLoad.
Related tables: dict_dataset, dict_field_detail, dict_dataset_sample
Collect dataset metadata from Hive.
Major related file: HiveExtract.py
Connect to Hive Metastore to get the Hive table/view information and store it in a local JSON file.
Major source tables: COLUMNS_V2, SERDE_PARAMS
Major related file: HiveTransform.py
Transform the JSON output into CSV format for easy loading.
Major related file: HiveLoad.py
Load into MySQL database.
Related tables: dict_dataset
Collect Azkaban execution information, including Azkaban flows/jobs definitions, DAGs, executions, owners, and schedules.
Major related file: AzkabanExtract.py
Connect to Azkaban MySQL database, collect metadata, and store in local file.
Major source tables from Azkaban database: project_flows, execution_flows, triggers, project_permissions
Major related file: AzkabanTransform.py, SchedulerTransform.py
Transform the JSON output into CSV format.
Major related file: AzkabanLoad.py, SchedulerLoad.py
Load into MySQL database. Major related tables: flow, flow_job, flow_dag, flow_schedule, flow_owner_permission, flow_execution, job_execution
Collect Oozie execution information, including workflow/coordinate jobs/actions dags, executions, owners and schedules.
Major related file: OozieExtract.py
Connect to Oozie MySQL database, collect metadata, and store in local file.
Major source tables from Oozie database: WF_JOBS, WF_ACTIONS, COORD_JOBS, COORD_ACTIONS
Major related file: OozieTransform.py, SchedulerTransform.py
Transform the JSON output into CSV format.
Major related file: OozieLoad.py, SchedulerLoad.py
Load into MySQL database. Major related tables: flow, flow_job, flow_dag, flow_schedule, flow_owner_permission, flow_execution, job_execution
TODO: add doc
Get Lineage Information for jobs that schedule through Azkaban See architecture for high level view of this process.
Major related file: lineage
In this lineage ETL process, you will need to communicate with hadoop job history nodes. If you have hadoop security setup, you need to make sure you have the right permission. There are templates of krb5.conf and gss-jaas.conf files in our repository. You need to put them in the right location (one of the following : running directory, $WH_HOME, $USER_HOME/.kerboros, /var/tmp/.kerboros). You may also use other authorization method as you needed.
ETL jobs not only depend on the source system’s data model, but sometimes also depend on the source system’s data content. For example, LinkedIn heavily uses the hour glass framework: https://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop to do incremental processing in Hadoop. Because we are mining some lineage and operation information from logs, we need to specify the log format in our system to capture that information.
In the AZKABAN_LINEAGE_METADATA_ETL job type, we need to use the following APIs to add configurations for the patterns we want to extract.
-
Add filename pattern
- Example: (.)/part-m-\d+.avro
- Usage: filename’s pattern to abstract from file level to directory level
- API: Add a filename pattern
-
Add dataset partition layout pattern
- Example: (.)/daily/(\d{4}/\d{2}/\d{2}).
- Usage: partitions pattern to abstract from partition level to dataset level
- API: Add a dataset partition pattern
-
Add log lineage pattern
- Example: Moving \S to (/(?!tmp)[\S])
- Usage: log lineage pattern to extract lineage from logs
- API: Add a lineage log pattern
-
Add log reference job ID pattern
- Example : INFO - Starting Job = (job\d+\d+)
- Usage: patterns used to discover the Hadoop map-reduce job ID inside the log
- API: Add a job ID log pattern
TODO: add doc
TODO: add doc
This is an optional feature.
Almost every enterprise has an LDAP server. This information is essential for any metadata containing an LDAP ID. For example, if we have the owner ID of a dataset, we also would like to know more about the owner: email address, phone number, manager, department, etc. All that information comes from the LDAP server.
Major related files: LdapExtract.py
There are two major parts at this stage: individual information and group/user mapping.
-
For an individual user, extract the following standardized attributes from the LDAP server:
'user_id', 'distinct_name', 'name', 'display_name', 'title', 'employee_number', 'manager', 'mail', 'department_number', 'department', 'start_date', 'mobile'
You can specify the actual LDAP return attributes in the job properties.
-
For a group account, find all the users in the group. Because a group can contain a subgroup, two mapping files are stored, one raw mapping file, and one flattened mapping file that will find all users in the subgroup.
Major related files: LdapTransform.py
Additional derived information is added during the transform stage: hierarchy. The path from the top management to the specified user is generated by recursively looking for manager until it reaches the CEO.
Major related files: LdapLoad.py
Loading from staging table into final table.
Collect dataset metadata from Druid.
Major related file: DruidMetadataExtractor.java
Get metadata of data sources from Druid cluster and store into local JSON files.
Major related file: DruidMetadataLoader.java
Loading from local JSON files to staging tables and finally into final table.
Related tables: dict_dataset, dict_field_detail