Add SparkSQLSource doc (#1102) · ufosky-ai/feathr@480e194

Commit

Add SparkSQLSource doc (feathr-ai#1102)

* add spark sql doc

Signed-off-by: Yuqing Wei <[email protected]>

* Clean up redis keys created by CI tests (feathr-ai#1100)

* Bump version to 1.0.0 (feathr-ai#1104)

* Update ARM template to use v1.0.0 tag (feathr-ai#1106)

- Update pre-built docker image from `feathrfeaturestore/feathr-registry:releases-v0.9.0` to `feathrfeaturestore/feathr-registry:releases-v1.0.0`

* Improve CI/CD workflow configuration (feathr-ai#1105)

- Update workflow names to be more descriptive
- Restrict pull_request_target configuration to workflow requires secret access
- Isolate gradle test off E2E test, and trigger for scala change only

* Add a guide to use Feathr in MLOps v2 Solution Accelerator (feathr-ai#1103)

* Improve Quickstart and Release Guide for v1.0.0 (feathr-ai#1107)

- Update the quickstart guide to make it easier for users to get started, validate feature definitions and develop new things

* Implement an optional null filter before join (feathr-ai#1098)

* Add null filter

* Add spark flag

* filter obs data nulls

* Remove feature data null handling

* Update test

* remove additional test

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Add support for external SWA library (feathr-ai#1093)

* working test

* Minor comment

* bump version

* documentation update

* update version

---------

Co-authored-by: rkashyap <[email protected]>
Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Update GitHub Actions for building and pushing images (feathr-ai#1109)

This PR addresses the issue of Spark materialize job failure on machines with an arm platform, such as Mac M1, due to pre-fetched amd64 versions of Python packages and Maven jars during docker image creation. To resolve this problem, Sandbox Docker GitHub action is updated to support the arm 64 platform.

- Update job name in `.github/workflows/publish-to-dockerhub.yml`
- Update `build-push-action` from v3 to v4
- Add `setup-qemu-action` and `setup-buildx-action`
- Add support for Linux/AMD64 and Linux platforms

* Upgrade actions/checkout version from v2 to v3 to clean up node 12 deprecated warnings (feathr-ai#1110)

- Upgrade action checkout version from `v2` to `v3`

* add simulate time delay feature (feathr-ai#1108)

* Support sql expression in FDSExtract (feathr-ai#1112)

* Add Fake Data Generator (feathr-ai#1113)

* Add Fake Data Generator

* update

* Update data_generator.py

* Update README (feathr-ai#1119)

* Update README to reflect the latest thought

* update readme

* Allow alien value in MVEL-based derivations (feathr-ai#1120)

* Fix feathr hocon command (feathr-ai#1121)

* Honor debug.output.num.parts in debug mode (feathr-ai#1122)

* Fix "value is not a valid dict" (feathr-ai#1111) (feathr-ai#1126)

Fix "value is not a valid dict"
when access sql-registry api /projects/{project}/datasources/{datasource}

Co-authored-by: brianxiao <[email protected]>

* Fix skipping features when derived feature contains a swa feature (feathr-ai#1128)

* Fix skipping features when derived feature contains a swa feature

* Fix comments

* Update documentation

* update version

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Skip snowflakes test in CI (feathr-ai#1131)

* Add Feathr chat bot in the notebook(Experimental, powered by ChatGPT) (feathr-ai#1132)

* ChatGPT integration

* Delay version bump

* Fix bug when SWA hdfs and local paths without data.avro.json extensio… (feathr-ai#1130)

* fix bug when SWA hdfs and local paths without data.avro.json extensions are included for evaluation

* try

* Fix tests

* revert test file

* Add tests

* Add private classifier to variable

* fix test

* fix test

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Revert mvel log (feathr-ai#1140)

* Revert "Allow alien value in MVEL-based derivations (feathr-ai#1120) and remove stdout statements"

This reverts commit 55290e7.

* updating rc version after last commit

---------

Co-authored-by: Anirudh Agarwal <[email protected]>

* Exclude experimental changes under feathr/chat for test coverage check (feathr-ai#1142)

* Revert "Update GitHub Actions for building and pushing images (feathr-ai#1109)" (feathr-ai#1141)

* Add try and catch for getTensorFeatures (feathr-ai#1136)

* Add try and catch for getTensorFeatures

* Attach the original exception with the throw

---------

Co-authored-by: Minh Nguyen <[email protected]>

* Enable override_time_delay (feathr-ai#1144)

* Update query_feature_list.py

* Update query_feature_list.py

* Fix incorrect merge in PR feathr-ai#1141

* #latest should pick the latest available path (feathr-ai#1146)

* #latest should pick the latest available path

* update gradle.properties

* add empty folder

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Update README.md

* Update README.md

* Section spell fix (feathr-ai#1147)

* Update troubleshoot-feature-definition.md

* Add a flag for adding a default value column for missing data features (feathr-ai#1149)

* WIP: safe mode

* Add swallowedExceptionHandler

* Fix minor bug

* Address comments

* version bump

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Improve debug logging (feathr-ai#1150)

* Suppressed exceptions api (feathr-ai#1152)

* Add another API for accessing doJoinObsAndFeatures which suppresses exceptions

* version bump

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Fix doJoinObsAndFeaturesWithSuppressedExceptions API (feathr-ai#1153)

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* minor version bump to 1.0.2-rc9 (feathr-ai#1154)

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* made function interface consistent with underlying delegation call (feathr-ai#1156)

Co-authored-by: Anirudh Agarwal <[email protected]>

* minor version bump (feathr-ai#1157)

Co-authored-by: Anirudh Agarwal <[email protected]>

* Update feathr-snowflake-guide.md

* fix debug path limit (feathr-ai#1160)

* Add default column for missing features (feathr-ai#1158)

* Add default column for missing features

* Fix failing test

* Fix SWA sparksession issue

* address comments

* Add comment

* bump version

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>
Co-authored-by: Jinghui Mo <[email protected]>

* Add new multi-level aggregation framework and bucketed count distinct aggregation. (feathr-ai#1159)

The bucketed aggregation works by aggregate data at lower level timestamp, e.g. 5 minutes bucket, then leverage the lower level bucket aggregated result to produce the higher level aggregation result such as 1 hour, 1 day, etc.

The support levels are 5 minutes, 1 hour, 1 week, 1 month, 1 year.

* Fix bug when skipping missing feature data (feathr-ai#1161)

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* version bump (feathr-ai#1162)

* version bump

* add logs

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* fix for handling missing feature data (feathr-ai#1163)

Co-authored-by: Anirudh Agarwal <[email protected]>

* Fix bug when skipping anchored features with missing data (feathr-ai#1164)

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* minor version bump to consume latest fix (feathr-ai#1165)

Co-authored-by: Anirudh Agarwal <[email protected]>

* Allow alien value in MVEL-based derivations (feathr-ai#1120) (feathr-ai#1166)

Add feature value wrapper for 3rdparity feature value compatibility

* add bucketed_sum aggregation (feathr-ai#1168)

* Seq join bug fix (feathr-ai#1169)

* Seq join bug fix

* Address comments

* version bump

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Fix failing tests

* Support high-dimensional tensor in derivations (feathr-ai#1172)

* Fix bug in SWA with missing feature data (feathr-ai#1171)

* Fix bug in SWA with missing feature data

* remove unwanted code

* Address feedback and version bump

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* minor version bump due to a PR getting directly merged (feathr-ai#1173)

Co-authored-by: Anirudh Agarwal <[email protected]>

* sparksql source doc

---------

Signed-off-by: Yuqing Wei <[email protected]>
Co-authored-by: Enya-Yx <[email protected]>
Co-authored-by: Blair Chen <[email protected]>
Co-authored-by: Rizo-R <[email protected]>
Co-authored-by: rakeshkashyap123 <[email protected]>
Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>
Co-authored-by: rkashyap <[email protected]>
Co-authored-by: aabbasi-hbo <[email protected]>
Co-authored-by: Jinghui Mo <[email protected]>
Co-authored-by: Xiaoyong Zhu <[email protected]>
Co-authored-by: BrianXiao <[email protected]>
Co-authored-by: brianxiao <[email protected]>
Co-authored-by: Anirudh Agarwal <[email protected]>
Co-authored-by: Anirudh Agarwal <[email protected]>
Co-authored-by: Minh Nguyen <[email protected]>
Co-authored-by: Minh Nguyen <[email protected]>
Co-authored-by: Hangfei Lin <[email protected]>
Co-authored-by: nj879 <[email protected]>
Co-authored-by: Anirudh Agarwal <[email protected]>

Loading branch information

19 people authored May 31, 2023

1 parent 5f0050d commit 480e194

docs/how-to-guides/sparksql-source-notes.md

-Original file line number
+Diff line change
@@ -0,0 +1,41 @@
+    ---
+    layout: default
+    title: Using `SparkSQLSource` as Data Source
+    parent: How-to Guides
+    ---
+    ## Use Databricks Tables as Data Source with `SparkSQLSource`
+    You may want to use tables as data source in Databricks. In this case, you can use SparkSQL to define a table and let Feathr read from it.
+    There are two ways supported to define a SparkSQL table:
+. SparkSQL query
+    You can define a SparkSQL query as data source in Feathr job. The query should return a Spark DataFrame.
+    ```python
+    from feathr.definition.source import SparkSqlSource
+    sql_source = SparkSqlSource(name="sparkSqlQuerySource", sql="SELECT * FROM green_tripdata_2020_04_with_index", event_timestamp_column="lpep_dropoff_datetime", timestamp_format="yyyy-MM-dd HH:mm:ss")
+    ```
+. SparkSQL table
+    If your source is already defined as a table in Databricks, you can directly use its name as data source in Feathr job.
+    ```python
+    from feathr.definition.source import SparkSqlSource
+    sql_source = SparkSqlSource(name="sparkSqlTableSource", table="green_tripdata_2020_04_with_index", event_timestamp_column="lpep_dropoff_datetime", timestamp_format="yyyy-MM-dd HH:mm:ss")
+    ```
+    After defining the source, you can use it in the Feathr job as usual.
+    ```python
+    agg_anchor = FeatureAnchor(name="aggregationFeatures",
+                                   source=sql_source,
+                                   features=agg_features)
+    ```
+    When using SparkSQL table as data source, you need to make sure the table can be accessed by Spark session as the Feathr job.
+    Similarly, tables in Blob storages can also be used as this `SparkSQLSrouce` when using synapse as spark provider.

0 comments on commit `480e194`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `480e194`

Commit

There are no files selected for viewing

0 comments on commit 480e194

0 comments on commit `480e194`