Fix data sample file link (feathr-ai#1042)

Signed-off-by: Jun Ki Min <[email protected]>
ufosky-ai · Feb 8, 2023 · 8df39bb · 8df39bb
1 parent a1b307e
commit 8df39bb
Show file tree

Hide file tree

Showing 2 changed files with 13 additions and 15 deletions.
diff --git a/docs/samples/feature_embedding.ipynb b/docs/samples/feature_embedding.ipynb
@@ -1,14 +1,15 @@
 {
  "cells": [
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Using Feature Embedding with Feathr Feature Store\n",
     "\n",
     "Feature embedding is a way to translate a high-dimensional feature vector to a lower-dimensional vector, where the embedding can be learned and reused across models. In this example, we show how one can define feature embeddings in Feathr Feature Store via **UDF (User Defined Function).**\n",
     "\n",
-    "We use a sample hotel review dataset downloaded from [Azure-Samples repository](https://github.com/Azure-Samples/azure-search-python-samples/tree/main/AzureML-Custom-Skill/datasets). The original dataset can be found [here](https://www.kaggle.com/datasets/datafiniti/hotel-reviews).\n",
+    "We use a sample hotel review dataset downloaded from [Azure-Samples repository](https://github.com/Azure-Samples/azure-search-sample-data). The original dataset can be found [here](https://www.kaggle.com/datasets/datafiniti/hotel-reviews).\n",
     "\n",
     "For the embedding, a pre-trained [HuggingFace Transformer model](https://huggingface.co/sentence-transformers) is used to encode texts into numerical values. The text embeddings can be used for many NLP problems such as detecting fake reviews, sentiment analysis, and finding similar hotels, but building such models is out of scope and thus we don't cover that in this notebook.\n",
     "\n",
@@ -212,7 +213,7 @@
    },
    "outputs": [],
    "source": [
-    "data_filepath = f\"{WORKING_DIR}/hotel_reviews_100_with_id.csv\"\n",
+    "data_filepath = f\"{WORKING_DIR}/hotel_reviews_with_id.csv\"\n",
     "maybe_download(src_url=HOTEL_REVIEWS_URL, dst_filepath=data_filepath)"
    ]
   },
@@ -372,10 +373,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "if client.spark_runtime == \"local\":\n",
-    "    data_source_path = data_filepath\n",
+    "if client.spark_runtime != \"databricks\":\n",
+    "    raise ValueError(\"To run this notebook, you must use Databricks as a target Spark cluster.\\\n",
+    "        To use other platforms, you'll need to install `sentence-transformers` pip package to your Spark cluster.\")\n",
+    "\n",
     "# If the notebook is running on Databricks, convert to spark path format\n",
-    "elif client.spark_runtime == \"databricks\" and is_databricks():\n",
+    "if is_databricks():\n",
     "    data_source_path = data_filepath.replace(\"/dbfs\", \"dbfs:\")\n",
     "# Otherwise, upload the local file to the cloud storage (either dbfs or adls).\n",
     "else:\n",
@@ -610,12 +613,7 @@
    "source": [
     "feature_name = \"f_reviews_text_embedding\"\n",
     "feature_key = registered_features[feature_name].key[0]\n",
-    "\n",
-    "if client.spark_runtime == \"databricks\":\n",
-    "    output_filepath = f\"dbfs:/{PROJECT_NAME}/feature_embeddings.parquet\"\n",
-    "else:\n",
-    "    raise ValueError(\"This notebook is expected to use Databricks as a target Spark cluster.\\\n",
-    " To use other platforms, you'll need to install `sentence-transformers` pip package to your Spark cluster.\")"
+    "output_filepath = f\"dbfs:/{PROJECT_NAME}/feature_embeddings.parquet\""
    ]
   },
   {
@@ -808,7 +806,7 @@
    "widgets": {}
   },
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "feathr",
    "language": "python",
    "name": "python3"
   },
@@ -822,11 +820,11 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.9"
+   "version": "3.8.16"
   },
   "vscode": {
    "interpreter": {
-    "hash": "e34a1a57d2e174682770a82d94a178aa36d3ccfaa21227c5d2308e319b7ae532"
+    "hash": "ddb0e38f168d5afaa0b8ab4851ddd8c14364f1d087c15de6ff2ee5a559aec1f2"
    }
   }
  },

diff --git a/feathr_project/feathr/datasets/constants.py b/feathr_project/feathr/datasets/constants.py
@@ -39,5 +39,5 @@
 # Hotel review sample datasets.
 # Ref: https://www.kaggle.com/datasets/datafiniti/hotel-reviews
 HOTEL_REVIEWS_URL = (
-    "https://raw.github.com/Azure-Samples/azure-search-python-samples/main/AzureML-Custom-Skill/datasets/hotel_reviews_1000.csv"
+    "https://raw.githubusercontent.com/Azure-Samples/azure-search-sample-data/main/hotelreviews/HotelReviews_data.csv"
 )