Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear example of export to datahub #198

Open
ian-lewis-d opened this issue Nov 1, 2022 · 4 comments
Open

Unclear example of export to datahub #198

ian-lewis-d opened this issue Nov 1, 2022 · 4 comments

Comments

@ian-lewis-d
Copy link

ian-lewis-d commented Nov 1, 2022

Loving PiiCatcher but I'm having some trouble pushing PII data to Datahub.

I'm following the instructions here https://tokern.io/docs/catalog/export/#installation and here https://tokern.io/docs/catalog/export/#basic-recipe.

The problem seems to be when reading my Sqllite database (which definitely has data).

Original Scan

$ piicatcher scan mysql --name mands_test_cat --database testdb --port 3306 --uri ***.rds.amazonaws.com --username *** --password ****

The sqllite file is called None and has the following structure:

None
  -> main
          -> alembic_version
          -> column_lineage
          -> columns
          -> default_schema
          -> job_executions
          -> jobs
          -> schemata
          -> sources
          -> sqlite_master
          -> tables
          -> tasks

This is my current export.yml file which partially works (talks to Datahub as expected)

  type: dbcat.datahub.CatalogSource
  config:
    database: main
    source_names:
      - mands_test_cat
sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

On running $ datahub ingest -c ./export.yml I receive the following error message.

sqlite3.OperationalError: no such table: sources

Do you have any suggestions for how I can approach this?

@vrajat
Copy link
Member

vrajat commented Nov 2, 2022 via email

@ian-lewis-d
Copy link
Author

ian-lewis-d commented Nov 24, 2022

Hi @vrajat,

My understanding from the Datahub engineering team is that they do not support piicatcher input.

Obviously, if you have a working solution it would be great to see it.

I am now able to run the datahub import command against the piicatcher SQLite db (using the dbcat.datahub.CatalogSource configuration.

However the import to Datahub is unsuccessful and tags are not applied.

Here is an example of the errors I see:

[2022-11-24 09:40:47,171] INFO     {datahub.ingestion.run.pipeline:174} - Sink configured successfully. DataHubRestEmitter: configured to talk to http://localhost:8080/
[2022-11-24 09:40:53,745] INFO     {datahub.ingestion.run.pipeline:197} - Source configured successfully.
[2022-11-24 09:40:53,746] INFO     {datahub.cli.ingest_cli:120} - Starting metadata ingestion
-[2022-11-24 09:40:53,985] ERROR    {datahub.ingestion.run.pipeline:57} -  failed to write record with workunit loan_management.account_holder with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: 
/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/com.linkedin.schema.SchemaMetadata/fields/4/globalTags/tags/1/tag :: "Provided urn [urn.li](http://urn.li/).tag.ADDRESS" is invalid\nERROR :: 
/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/com.linkedin.schema.SchemaMetadata/fields/5/globalTags/tags/1/tag :: "Provided urn urn.li.tag.PERSON" is invalid\n', 'message': 
'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: 
ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/c', 'status': 422, 'id': 'urn:li:dataset:
(urn:li:dataPlatform:mysql,<snip>.account_holder,PROD)'}) and info {'exceptionClass': 
'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.validation.ValidationException:
....
....
              {'error': 'Unable to emit metadata to DataHub GMS',
               'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                        'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: '
                                      'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class '
                                      'com.linkedin.entity.Entity: ERROR :: '
                                      '/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/com.linkedin.schema.SchemaMetadata/fields/3/globalTags/tags/1/tag '
                                      ':: "Provided urn urn.li.tag.PERSON" is invalid\n'
                                      '\n'
                                      '\tat com.linkedin.metadata.resources.entity.EntityResource.ingest(EntityResource.java:213)',
                        'message': 'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class '
                                   'com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/c',
                        'status': 422,
                        'id': 'urn:li:dataset:(urn:li:dataPlatform:mysql,<snip>.editions,PROD)'}},
              '... sampled of 87 total elements'],
 'start_time': '2022-11-24 09:40:47.165725 (11.66 seconds ago).',
 'current_time': '2022-11-24 09:40:58.824433 (now).',
 'total_duration_in_seconds': '11.66',
 'gms_version': 'v0.9.2',
 'pending_requests': '0'}

 Pipeline finished with at least 87 failures ; produced 181 events in 5.08 seconds.
It seems the errors are similar to "Provided urn urn.li.tag.ADDRESS" is invalid\nERROR

What am I missing to get this to ingest?```

@vrajat
Copy link
Member

vrajat commented Nov 25, 2022

@yjagdale will you be able to help?

@yjagdale
Copy link
Contributor

Sure @vrajat will have look at it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants