Skip to content

Latest commit

 

History

History
 
 

wherehows-etl

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

WhereHows ETL jobs

This directory contains the code for all the ETL jobs that WhereHows can run.

  1. HDFS Dataset
  2. Teradata Dataset
  3. Oracle Dataset
  4. Hive Dataset
  5. Azkaban Execution
  6. Oozie Execution
  7. Appworx Execution
  8. Azkaban Lineage
  9. Appworx Lineage
  10. Elastic Search
  11. LDAP Directory
  12. Druid Dataset

The ETL jobs are scheduled and configured via job files. Please refer to this directory for more details.

HDFS Dataset

Collect dataset metadata from HDFS.

Job file template

Extract

Major related file: wherehows-hadoop

The standalone module 'wherehows-hadoop' is responsible for this process.

In a real production environment, the machine used to run this ETL job is usually different from the machine that is the gateway to Hadoop. So we need to copy the runnable JAR file to a remote machine, execute remotely, and copy back the result.

At compile time, the module wherehows-hadoop is packaged into a standalone JAR file. At runtime, a Jython script copies it to the Hadoop gateway, remotely runs it on the Hadoop gateway, and copies the result back.

Inside the module, we use a whitelist of folders (configured through parameters) as the starting point to scan through the folders and files. After abstract at the dataset level, we then extract schema, sample data, and related metadata from them. The final step is to store this into two result files: metadata file and sample data file.

Transform

Major related file: HdfsTransform.py

Transform the JSON output into CSV format for easy loading.

Load

Major related file: HdfsLoad.py

Load into MySQL database. related tables : dict_dataset, dict_dataset_sample, dict_field_detail

Teradata Dataset

Collect dataset metadata from Teradata.

Job file template

Extract

Major related file: TeradataExtract.py

Get metadata from Teradata DBC databases and store it in a local JSON file.

Major source tables: DBC.Tables, DBC.Columns, DBC.ColumnStatsV, DBC.TableSize, DBC.Indieces, DBC.IndexConstraints

'Select top 10' to get sample data.

Transform

Major related file: TeradataTransform.py

Transform the JSON output into CSV format for easy loading.

Load

Major related file: TeradataLoad.py

Load into MySQL database.

Related tables: dict_dataset, dict_field_detail, dict_dataset_sample

Oracle Dataset

Collect dataset metadata from Oracle DB.

Job file template

Extract

Major related file: OracleExtract.py

Connect to Oracle database to get all the table/column/comments information excluding the databases in the exclude list. Extra table information including indices, constraints and partitions are also fetched. The results are formatted and stored in two CSV files, one for table records and the other for field records.

Major source tables: ALL_TABLES, ALL_TAB_COLUMNS, ALL_COL_COMMENTS, ALL_INDEXES, ALL_IND_COLUMNS, ALL_CONSTRAINTS, ALL_PART_KEY_COLUMNS

Transform

Not needed.

Load

Major related file: OracleLoad.py

Load into MySQL database, similar to HiveLoad or HdfsLoad.

Related tables: dict_dataset, dict_field_detail, dict_dataset_sample

Hive Dataset

Collect dataset metadata from Hive.

Job file template

Extract

Major related file: HiveExtract.py

Connect to Hive Metastore to get the Hive table/view information and store it in a local JSON file.

Major source tables: COLUMNS_V2, SERDE_PARAMS

Transform

Major related file: HiveTransform.py

Transform the JSON output into CSV format for easy loading.

Load

Major related file: HiveLoad.py

Load into MySQL database.

Related tables: dict_dataset

Azkaban Execution

Collect Azkaban execution information, including Azkaban flows/jobs definitions, DAGs, executions, owners, and schedules.

Job file template

Extract

Major related file: AzkabanExtract.py

Connect to Azkaban MySQL database, collect metadata, and store in local file.

Major source tables from Azkaban database: project_flows, execution_flows, triggers, project_permissions

Transform

Major related file: AzkabanTransform.py, SchedulerTransform.py

Transform the JSON output into CSV format.

Load

Major related file: AzkabanLoad.py, SchedulerLoad.py

Load into MySQL database. Major related tables: flow, flow_job, flow_dag, flow_schedule, flow_owner_permission, flow_execution, job_execution

Oozie Execution

Collect Oozie execution information, including workflow/coordinate jobs/actions dags, executions, owners and schedules.

Job file template

Extract

Major related file: OozieExtract.py

Connect to Oozie MySQL database, collect metadata, and store in local file.

Major source tables from Oozie database: WF_JOBS, WF_ACTIONS, COORD_JOBS, COORD_ACTIONS

Transform

Major related file: OozieTransform.py, SchedulerTransform.py

Transform the JSON output into CSV format.

Load

Major related file: OozieLoad.py, SchedulerLoad.py

Load into MySQL database. Major related tables: flow, flow_job, flow_dag, flow_schedule, flow_owner_permission, flow_execution, job_execution

Appworx Execution

TODO: add doc

Job file template

Azkaban Lineage

Get Lineage Information for jobs that schedule through Azkaban See architecture for high level view of this process.

Major related file: lineage

Job file template

Security

In this lineage ETL process, you will need to communicate with hadoop job history nodes. If you have hadoop security setup, you need to make sure you have the right permission. There are templates of krb5.conf and gss-jaas.conf files in our repository. You need to put them in the right location (one of the following : running directory, $WH_HOME, $USER_HOME/.kerboros, /var/tmp/.kerboros). You may also use other authorization method as you needed.

Content Related Configuration

ETL jobs not only depend on the source system’s data model, but sometimes also depend on the source system’s data content. For example, LinkedIn heavily uses the hour glass framework: https://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop to do incremental processing in Hadoop. Because we are mining some lineage and operation information from logs, we need to specify the log format in our system to capture that information.

In the AZKABAN_LINEAGE_METADATA_ETL job type, we need to use the following APIs to add configurations for the patterns we want to extract.

  • Add filename pattern

    • Example: (.)/part-m-\d+.avro
    • Usage: filename’s pattern to abstract from file level to directory level
    • API: Add a filename pattern
  • Add dataset partition layout pattern

  • Add log lineage pattern

    • Example: Moving \S to (/(?!tmp)[\S])
    • Usage: log lineage pattern to extract lineage from logs
    • API: Add a lineage log pattern
  • Add log reference job ID pattern

    • Example : INFO - Starting Job = (job\d+\d+)
    • Usage: patterns used to discover the Hadoop map-reduce job ID inside the log
    • API: Add a job ID log pattern

Appworx Lineage

TODO: add doc

Job file template

Elastic Search

TODO: add doc

Job file template

LDAP Directory

This is an optional feature.

Almost every enterprise has an LDAP server. This information is essential for any metadata containing an LDAP ID. For example, if we have the owner ID of a dataset, we also would like to know more about the owner: email address, phone number, manager, department, etc. All that information comes from the LDAP server.

Job file template

Extract

Major related files: LdapExtract.py

There are two major parts at this stage: individual information and group/user mapping.

  • For an individual user, extract the following standardized attributes from the LDAP server:

    'user_id', 'distinct_name', 'name', 'display_name', 'title', 'employee_number', 'manager', 'mail', 'department_number', 'department', 'start_date', 'mobile'

    You can specify the actual LDAP return attributes in the job properties.

  • For a group account, find all the users in the group. Because a group can contain a subgroup, two mapping files are stored, one raw mapping file, and one flattened mapping file that will find all users in the subgroup.

Transform

Major related files: LdapTransform.py

Additional derived information is added during the transform stage: hierarchy. The path from the top management to the specified user is generated by recursively looking for manager until it reaches the CEO.

Load

Major related files: LdapLoad.py

Loading from staging table into final table.

Druid Dataset

Collect dataset metadata from Druid.

Job file template

Extract

Major related file: DruidMetadataExtractor.java

Get metadata of data sources from Druid cluster and store into local JSON files.

Load

Major related file: DruidMetadataLoader.java

Loading from local JSON files to staging tables and finally into final table.

Related tables: dict_dataset, dict_field_detail