Query costs and estimates clarified (#11)

* add query estimate note * GCP_PROJECT variable * note update * description fix * source json formatted
HTTPArchive · Jul 21, 2024 · 849cfde · 849cfde
1 parent 91b749a
commit 849cfde
Show file tree

Hide file tree

Showing 5 changed files with 18,013 additions and 6 deletions.
diff --git a/src/content/docs/guides/guided-tour.mdx b/src/content/docs/guides/guided-tour.mdx
@@ -24,3 +24,9 @@ This guide is split into multiple sections, each one focusing on different table
 1. [Exploring the `httparchive.all.pages` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-pages_tables.ipynb)
 2. [Exploring the `httparchive.all.requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-requests_tables.ipynb)
 3. [JOINing `pages` and `requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_pages_and_requests_tables_joined.ipynb)
+
+:::caution
+HTTP Archive uses clustered tables. BigQuery [doesn't guarantee](https://cloud.google.com/bigquery/docs/clustered-tables#clustered_table_pricing:~:text=BigQuery%20might%20not%20be%20able%20to%20accurately%20estimate%20the%20bytes%20to%20be%20processed) accuracy of estimations for bytes to be processed when querying clustered tables. For your information the actual bytes processed amount is provided in a comment for each query.
+
+Please also read [Minimizing query costs](../minimizing-costs/) for more details on the topic.
+:::
diff --git a/src/content/docs/guides/minimizing-costs.md b/src/content/docs/guides/minimizing-costs.md
@@ -14,6 +14,10 @@ Table | Partitioned by | Clustered by
 
 For example, the `httparchive.all.pages` table is [partitioned](https://cloud.google.com/bigquery/docs/partitioned-tables) by `date` and [clustered](https://cloud.google.com/bigquery/docs/clustered-tables) by the `client`, `is_root_page`, and `rank` columns, which means that queries that filter on these columns will be much faster and cheaper than queries that don't.
 
+:::caution
+BigQuery [doesn't guarantee](https://cloud.google.com/bigquery/docs/clustered-tables#clustered_table_pricing:~:text=BigQuery%20might%20not%20be%20able%20to%20accurately%20estimate%20the%20bytes%20to%20be%20processed) accuracy of estimations for 'Bytes processed' when querying clustered tables ([Issue Link](https://issuetracker.google.com/issues/176795805)). The actual data volume may be smaller than the amount provided in the estimate.
+:::
+
 Legacy tables like `httparchive.pages.2023_05_01_desktop`, however, do not take advantage of these optimizations and always incur the full cost of scanning the entire table.
 
 :::tip

diff --git a/workbooks/exploring_httparchive-all-pages_tables.ipynb b/workbooks/exploring_httparchive-all-pages_tables.ipynb
diff --git a/workbooks/exploring_httparchive-all-requests_tables.ipynb b/workbooks/exploring_httparchive-all-requests_tables.ipynb
diff --git a/workbooks/exploring_pages_and_requests_tables_joined.ipynb b/workbooks/exploring_pages_and_requests_tables_joined.ipynb
@@ -13,6 +13,15 @@
         "auth.authenticate_user()"
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "GCP_PROJECT = 'httparchive'  # @param {type: \"string\"}"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -425,7 +434,7 @@
       ],
       "source": [
         "# This query will process 25 GB when run.\n",
-        "%%bigquery --project httparchive\n",
+        "%%bigquery --project {GCP_PROJECT}\n",
         "SELECT\n",
         "  page,\n",
         "  CAST(JSON_VALUE(summary, '$.reqImg') AS INT64) AS image_requests,\n",
@@ -854,7 +863,7 @@
       ],
       "source": [
         "# This query will process 80 GB when run.\n",
-        "%%bigquery --project httparchive\n",
+        "%%bigquery --project {GCP_PROJECT}\n",
         "WITH pages AS (\n",
         "  SELECT\n",
         "    page\n",
@@ -972,7 +981,7 @@
       ],
       "source": [
         "# This query will process 131 GB when run.\n",
-        "%%bigquery df_requests_type --project httparchive\n",
+        "%%bigquery df_requests_type --project {GCP_PROJECT}\n",
         "SELECT\n",
         "    type,\n",
         "    COUNT(0) AS requests,\n",
@@ -1560,7 +1569,7 @@
       ],
       "source": [
         "# This query will process 47 GB when run.\n",
-        "%%bigquery df_tbt_scripts --project httparchive\n",
+        "%%bigquery df_tbt_scripts --project {GCP_PROJECT}\n",
         "WITH pages AS (\n",
         "  SELECT\n",
         "    page,\n",