Skip to content

Commit

Permalink
Add SparkSQLSource doc (feathr-ai#1102)
Browse files Browse the repository at this point in the history
* add spark sql doc

Signed-off-by: Yuqing Wei <[email protected]>

* Clean up redis keys created by CI tests (feathr-ai#1100)

* Bump version to 1.0.0 (feathr-ai#1104)

* Update ARM template to use v1.0.0 tag (feathr-ai#1106)

- Update pre-built docker image from `feathrfeaturestore/feathr-registry:releases-v0.9.0` to `feathrfeaturestore/feathr-registry:releases-v1.0.0`

* Improve CI/CD workflow configuration (feathr-ai#1105)

- Update workflow names to be more descriptive
- Restrict pull_request_target configuration to workflow requires secret access
- Isolate gradle test off E2E test, and trigger for scala change only

* Add a guide to use Feathr in MLOps v2 Solution Accelerator (feathr-ai#1103)

* Improve Quickstart and Release Guide for v1.0.0 (feathr-ai#1107)

- Update the quickstart guide to make it easier for users to get started, validate feature definitions and develop new things

* Implement an optional null filter before join (feathr-ai#1098)

* Add null filter

* Add spark flag

* filter obs data nulls

* Remove feature data null handling

* Update test

* remove additional test

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Add support for external SWA library (feathr-ai#1093)

* working test

* Minor comment

* bump version

* documentation update

* update version

---------

Co-authored-by: rkashyap <[email protected]>
Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Update GitHub Actions for building and pushing images (feathr-ai#1109)

This PR addresses the issue of Spark materialize job failure on machines with an arm platform, such as Mac M1, due to pre-fetched amd64 versions of Python packages and Maven jars during docker image creation. To resolve this problem, Sandbox Docker GitHub action is updated to support the arm 64 platform.

- Update job name in `.github/workflows/publish-to-dockerhub.yml`
- Update `build-push-action` from v3 to v4
- Add `setup-qemu-action` and `setup-buildx-action`
- Add support for Linux/AMD64 and Linux platforms

* Upgrade actions/checkout version from v2 to v3 to clean up node 12 deprecated warnings (feathr-ai#1110)

- Upgrade action checkout version from `v2` to `v3`

* add simulate time delay feature (feathr-ai#1108)

* Support sql expression in FDSExtract (feathr-ai#1112)

* Add Fake Data Generator (feathr-ai#1113)

* Add Fake Data Generator

* update

* Update data_generator.py

* Update README (feathr-ai#1119)

* Update README to reflect the latest thought

* update readme

* Allow alien value in MVEL-based derivations (feathr-ai#1120)

* Fix feathr hocon command (feathr-ai#1121)

* Honor debug.output.num.parts in debug mode (feathr-ai#1122)

* Fix "value is not a valid dict" (feathr-ai#1111) (feathr-ai#1126)

Fix "value is not a valid dict"
when access sql-registry api /projects/{project}/datasources/{datasource}

Co-authored-by: brianxiao <[email protected]>

* Fix skipping features when derived feature contains a swa feature (feathr-ai#1128)

* Fix skipping features when derived feature contains a swa feature

* Fix comments

* Update documentation

* update version

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Skip snowflakes test in CI (feathr-ai#1131)

* Add Feathr chat bot in the notebook(Experimental, powered by ChatGPT) (feathr-ai#1132)

* ChatGPT integration

* Delay version bump

* Fix bug when SWA hdfs and local paths without data.avro.json extensio… (feathr-ai#1130)

* fix bug when SWA hdfs and local paths without data.avro.json extensions are included for evaluation

* try

* Fix tests

* revert test file

* Add tests

* Add private classifier to variable

* fix test

* fix test

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Revert mvel log (feathr-ai#1140)

* Revert "Allow alien value in MVEL-based derivations (feathr-ai#1120) and remove stdout statements"

This reverts commit 55290e7.

* updating rc version after last commit

---------

Co-authored-by: Anirudh Agarwal <[email protected]>

* Exclude experimental changes under feathr/chat for test coverage check (feathr-ai#1142)

* Revert "Update GitHub Actions for building and pushing images (feathr-ai#1109)" (feathr-ai#1141)

* Add try and catch for getTensorFeatures (feathr-ai#1136)

* Add try and catch for getTensorFeatures

* Attach the original exception with the throw

---------

Co-authored-by: Minh Nguyen <[email protected]>

* Enable override_time_delay (feathr-ai#1144)

* Update query_feature_list.py

* Update query_feature_list.py

* Fix incorrect merge in PR feathr-ai#1141

* #latest should pick the latest available path (feathr-ai#1146)

* #latest should pick the latest available path

* update gradle.properties

* add empty folder

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Update README.md

* Update README.md

* Section spell fix (feathr-ai#1147)

* Update troubleshoot-feature-definition.md

* Add a flag for adding a default value column for missing data features (feathr-ai#1149)

* WIP: safe mode

* Add swallowedExceptionHandler

* Fix minor bug

* Address comments

* version bump

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Improve debug logging (feathr-ai#1150)

* Suppressed exceptions api (feathr-ai#1152)

* Add another API for accessing doJoinObsAndFeatures which suppresses exceptions

* version bump

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Fix doJoinObsAndFeaturesWithSuppressedExceptions API (feathr-ai#1153)

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* minor version bump to 1.0.2-rc9 (feathr-ai#1154)

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* made function interface consistent with underlying delegation call (feathr-ai#1156)

Co-authored-by: Anirudh Agarwal <[email protected]>

* minor version bump (feathr-ai#1157)

Co-authored-by: Anirudh Agarwal <[email protected]>

* Update feathr-snowflake-guide.md

* fix debug path limit (feathr-ai#1160)

* Add default column for missing features (feathr-ai#1158)

* Add default column for missing features

* Fix failing test

* Fix SWA sparksession issue

* address comments

* Add comment

* bump version

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>
Co-authored-by: Jinghui Mo <[email protected]>

* Add new multi-level aggregation framework and bucketed count distinct aggregation. (feathr-ai#1159)

The bucketed aggregation works by aggregate data at lower level timestamp, e.g. 5 minutes bucket, then leverage the lower level bucket aggregated result to produce the higher level aggregation result such as 1 hour, 1 day, etc.

The support levels are 5 minutes, 1 hour, 1 week, 1 month, 1 year.

* Fix bug when skipping missing feature data (feathr-ai#1161)

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* version bump (feathr-ai#1162)

* version bump

* add logs

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* fix for handling missing feature data (feathr-ai#1163)

Co-authored-by: Anirudh Agarwal <[email protected]>

* Fix bug when skipping anchored features with missing data (feathr-ai#1164)

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* minor version bump to consume latest fix (feathr-ai#1165)

Co-authored-by: Anirudh Agarwal <[email protected]>

* Allow alien value in MVEL-based derivations (feathr-ai#1120) (feathr-ai#1166)

Add feature value wrapper for 3rdparity feature value compatibility

* add bucketed_sum aggregation (feathr-ai#1168)

* Seq join bug fix (feathr-ai#1169)

* Seq join bug fix

* Address comments

* version bump

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* Fix failing tests

* Support high-dimensional tensor in derivations (feathr-ai#1172)

* Fix bug in SWA with missing feature data (feathr-ai#1171)

* Fix bug in SWA with missing feature data

* remove unwanted code

* Address feedback and version bump

---------

Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>

* minor version bump due to a PR getting directly merged (feathr-ai#1173)

Co-authored-by: Anirudh Agarwal <[email protected]>

* sparksql source doc

---------

Signed-off-by: Yuqing Wei <[email protected]>
Co-authored-by: Enya-Yx <[email protected]>
Co-authored-by: Blair Chen <[email protected]>
Co-authored-by: Rizo-R <[email protected]>
Co-authored-by: rakeshkashyap123 <[email protected]>
Co-authored-by: Rakesh Kashyap Hanasoge Padmanabha <[email protected]>
Co-authored-by: rkashyap <[email protected]>
Co-authored-by: aabbasi-hbo <[email protected]>
Co-authored-by: Jinghui Mo <[email protected]>
Co-authored-by: Xiaoyong Zhu <[email protected]>
Co-authored-by: BrianXiao <[email protected]>
Co-authored-by: brianxiao <[email protected]>
Co-authored-by: Anirudh Agarwal <[email protected]>
Co-authored-by: Anirudh Agarwal <[email protected]>
Co-authored-by: Minh Nguyen <[email protected]>
Co-authored-by: Minh Nguyen <[email protected]>
Co-authored-by: Hangfei Lin <[email protected]>
Co-authored-by: nj879 <[email protected]>
Co-authored-by: Anirudh Agarwal <[email protected]>
  • Loading branch information
19 people authored May 31, 2023
1 parent 5f0050d commit 480e194
Showing 1 changed file with 41 additions and 0 deletions.
41 changes: 41 additions & 0 deletions docs/how-to-guides/sparksql-source-notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
layout: default
title: Using `SparkSQLSource` as Data Source
parent: How-to Guides
---

## Use Databricks Tables as Data Source with `SparkSQLSource`

You may want to use tables as data source in Databricks. In this case, you can use SparkSQL to define a table and let Feathr read from it.

There are two ways supported to define a SparkSQL table:
1. SparkSQL query
You can define a SparkSQL query as data source in Feathr job. The query should return a Spark DataFrame.

```python
from feathr.definition.source import SparkSqlSource

sql_source = SparkSqlSource(name="sparkSqlQuerySource", sql="SELECT * FROM green_tripdata_2020_04_with_index", event_timestamp_column="lpep_dropoff_datetime", timestamp_format="yyyy-MM-dd HH:mm:ss")

```

2. SparkSQL table
If your source is already defined as a table in Databricks, you can directly use its name as data source in Feathr job.

```python
from feathr.definition.source import SparkSqlSource

sql_source = SparkSqlSource(name="sparkSqlTableSource", table="green_tripdata_2020_04_with_index", event_timestamp_column="lpep_dropoff_datetime", timestamp_format="yyyy-MM-dd HH:mm:ss")
```

After defining the source, you can use it in the Feathr job as usual.

```python
agg_anchor = FeatureAnchor(name="aggregationFeatures",
source=sql_source,
features=agg_features)
```

When using SparkSQL table as data source, you need to make sure the table can be accessed by Spark session as the Feathr job.

Similarly, tables in Blob storages can also be used as this `SparkSQLSrouce` when using synapse as spark provider.

0 comments on commit 480e194

Please sign in to comment.