Skip to content

Commit

Permalink
Fix data sample file link (feathr-ai#1042)
Browse files Browse the repository at this point in the history
Signed-off-by: Jun Ki Min <[email protected]>
  • Loading branch information
loomlike authored Feb 8, 2023
1 parent a1b307e commit 8df39bb
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 15 deletions.
26 changes: 12 additions & 14 deletions docs/samples/feature_embedding.ipynb
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using Feature Embedding with Feathr Feature Store\n",
"\n",
"Feature embedding is a way to translate a high-dimensional feature vector to a lower-dimensional vector, where the embedding can be learned and reused across models. In this example, we show how one can define feature embeddings in Feathr Feature Store via **UDF (User Defined Function).**\n",
"\n",
"We use a sample hotel review dataset downloaded from [Azure-Samples repository](https://github.com/Azure-Samples/azure-search-python-samples/tree/main/AzureML-Custom-Skill/datasets). The original dataset can be found [here](https://www.kaggle.com/datasets/datafiniti/hotel-reviews).\n",
"We use a sample hotel review dataset downloaded from [Azure-Samples repository](https://github.com/Azure-Samples/azure-search-sample-data). The original dataset can be found [here](https://www.kaggle.com/datasets/datafiniti/hotel-reviews).\n",
"\n",
"For the embedding, a pre-trained [HuggingFace Transformer model](https://huggingface.co/sentence-transformers) is used to encode texts into numerical values. The text embeddings can be used for many NLP problems such as detecting fake reviews, sentiment analysis, and finding similar hotels, but building such models is out of scope and thus we don't cover that in this notebook.\n",
"\n",
Expand Down Expand Up @@ -212,7 +213,7 @@
},
"outputs": [],
"source": [
"data_filepath = f\"{WORKING_DIR}/hotel_reviews_100_with_id.csv\"\n",
"data_filepath = f\"{WORKING_DIR}/hotel_reviews_with_id.csv\"\n",
"maybe_download(src_url=HOTEL_REVIEWS_URL, dst_filepath=data_filepath)"
]
},
Expand Down Expand Up @@ -372,10 +373,12 @@
"metadata": {},
"outputs": [],
"source": [
"if client.spark_runtime == \"local\":\n",
" data_source_path = data_filepath\n",
"if client.spark_runtime != \"databricks\":\n",
" raise ValueError(\"To run this notebook, you must use Databricks as a target Spark cluster.\\\n",
" To use other platforms, you'll need to install `sentence-transformers` pip package to your Spark cluster.\")\n",
"\n",
"# If the notebook is running on Databricks, convert to spark path format\n",
"elif client.spark_runtime == \"databricks\" and is_databricks():\n",
"if is_databricks():\n",
" data_source_path = data_filepath.replace(\"/dbfs\", \"dbfs:\")\n",
"# Otherwise, upload the local file to the cloud storage (either dbfs or adls).\n",
"else:\n",
Expand Down Expand Up @@ -610,12 +613,7 @@
"source": [
"feature_name = \"f_reviews_text_embedding\"\n",
"feature_key = registered_features[feature_name].key[0]\n",
"\n",
"if client.spark_runtime == \"databricks\":\n",
" output_filepath = f\"dbfs:/{PROJECT_NAME}/feature_embeddings.parquet\"\n",
"else:\n",
" raise ValueError(\"This notebook is expected to use Databricks as a target Spark cluster.\\\n",
" To use other platforms, you'll need to install `sentence-transformers` pip package to your Spark cluster.\")"
"output_filepath = f\"dbfs:/{PROJECT_NAME}/feature_embeddings.parquet\""
]
},
{
Expand Down Expand Up @@ -808,7 +806,7 @@
"widgets": {}
},
"kernelspec": {
"display_name": "Python 3",
"display_name": "feathr",
"language": "python",
"name": "python3"
},
Expand All @@ -822,11 +820,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.8.16"
},
"vscode": {
"interpreter": {
"hash": "e34a1a57d2e174682770a82d94a178aa36d3ccfaa21227c5d2308e319b7ae532"
"hash": "ddb0e38f168d5afaa0b8ab4851ddd8c14364f1d087c15de6ff2ee5a559aec1f2"
}
}
},
Expand Down
2 changes: 1 addition & 1 deletion feathr_project/feathr/datasets/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,5 +39,5 @@
# Hotel review sample datasets.
# Ref: https://www.kaggle.com/datasets/datafiniti/hotel-reviews
HOTEL_REVIEWS_URL = (
"https://raw.github.com/Azure-Samples/azure-search-python-samples/main/AzureML-Custom-Skill/datasets/hotel_reviews_1000.csv"
"https://raw.githubusercontent.com/Azure-Samples/azure-search-sample-data/main/hotelreviews/HotelReviews_data.csv"
)

0 comments on commit 8df39bb

Please sign in to comment.