Skip to content

Commit

Permalink
Query costs and estimates clarified (#11)
Browse files Browse the repository at this point in the history
* add query estimate note

* GCP_PROJECT variable

* note update

* description fix

* source json formatted
  • Loading branch information
max-ostapenko authored Jul 21, 2024
1 parent 91b749a commit 849cfde
Show file tree
Hide file tree
Showing 5 changed files with 18,013 additions and 6 deletions.
6 changes: 6 additions & 0 deletions src/content/docs/guides/guided-tour.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,9 @@ This guide is split into multiple sections, each one focusing on different table
1. [Exploring the `httparchive.all.pages` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-pages_tables.ipynb)
2. [Exploring the `httparchive.all.requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-requests_tables.ipynb)
3. [JOINing `pages` and `requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_pages_and_requests_tables_joined.ipynb)

:::caution
HTTP Archive uses clustered tables. BigQuery [doesn't guarantee](https://cloud.google.com/bigquery/docs/clustered-tables#clustered_table_pricing:~:text=BigQuery%20might%20not%20be%20able%20to%20accurately%20estimate%20the%20bytes%20to%20be%20processed) accuracy of estimations for bytes to be processed when querying clustered tables. For your information the actual bytes processed amount is provided in a comment for each query.

Please also read [Minimizing query costs](../minimizing-costs/) for more details on the topic.
:::
4 changes: 4 additions & 0 deletions src/content/docs/guides/minimizing-costs.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ Table | Partitioned by | Clustered by

For example, the `httparchive.all.pages` table is [partitioned](https://cloud.google.com/bigquery/docs/partitioned-tables) by `date` and [clustered](https://cloud.google.com/bigquery/docs/clustered-tables) by the `client`, `is_root_page`, and `rank` columns, which means that queries that filter on these columns will be much faster and cheaper than queries that don't.

:::caution
BigQuery [doesn't guarantee](https://cloud.google.com/bigquery/docs/clustered-tables#clustered_table_pricing:~:text=BigQuery%20might%20not%20be%20able%20to%20accurately%20estimate%20the%20bytes%20to%20be%20processed) accuracy of estimations for 'Bytes processed' when querying clustered tables ([Issue Link](https://issuetracker.google.com/issues/176795805)). The actual data volume may be smaller than the amount provided in the estimate.
:::

Legacy tables like `httparchive.pages.2023_05_01_desktop`, however, do not take advantage of these optimizations and always incur the full cost of scanning the entire table.

:::tip
Expand Down
8,579 changes: 8,578 additions & 1 deletion workbooks/exploring_httparchive-all-pages_tables.ipynb

Large diffs are not rendered by default.

9,413 changes: 9,412 additions & 1 deletion workbooks/exploring_httparchive-all-requests_tables.ipynb

Large diffs are not rendered by default.

17 changes: 13 additions & 4 deletions workbooks/exploring_pages_and_requests_tables_joined.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,15 @@
"auth.authenticate_user()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"GCP_PROJECT = 'httparchive' # @param {type: \"string\"}"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand Down Expand Up @@ -425,7 +434,7 @@
],
"source": [
"# This query will process 25 GB when run.\n",
"%%bigquery --project httparchive\n",
"%%bigquery --project {GCP_PROJECT}\n",
"SELECT\n",
" page,\n",
" CAST(JSON_VALUE(summary, '$.reqImg') AS INT64) AS image_requests,\n",
Expand Down Expand Up @@ -854,7 +863,7 @@
],
"source": [
"# This query will process 80 GB when run.\n",
"%%bigquery --project httparchive\n",
"%%bigquery --project {GCP_PROJECT}\n",
"WITH pages AS (\n",
" SELECT\n",
" page\n",
Expand Down Expand Up @@ -972,7 +981,7 @@
],
"source": [
"# This query will process 131 GB when run.\n",
"%%bigquery df_requests_type --project httparchive\n",
"%%bigquery df_requests_type --project {GCP_PROJECT}\n",
"SELECT\n",
" type,\n",
" COUNT(0) AS requests,\n",
Expand Down Expand Up @@ -1560,7 +1569,7 @@
],
"source": [
"# This query will process 47 GB when run.\n",
"%%bigquery df_tbt_scripts --project httparchive\n",
"%%bigquery df_tbt_scripts --project {GCP_PROJECT}\n",
"WITH pages AS (\n",
" SELECT\n",
" page,\n",
Expand Down

0 comments on commit 849cfde

Please sign in to comment.