Skip to content

Commit

Permalink
updates
Browse files Browse the repository at this point in the history
  • Loading branch information
rviscomi committed Jun 7, 2023
1 parent b4b0786 commit 699107c
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion src/content/docs/guides/minimizing-costs.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,12 @@ The HTTP Archive dataset is large and complex, and it's easy to write queries th

## Use clustered tables

For example, the `httparchive.all.pages` table is [partitioned](https://cloud.google.com/bigquery/docs/partitioned-tables) and [clustered](https://cloud.google.com/bigquery/docs/clustered-tables) on the `client`, `is_root_page`, and `rank` columns, which means that queries that filter on these columns will be much faster and cheaper than queries that don't.
Table | Partitioned on | Clustered on
--- | --- | ---
`httparchive.all.pages` | `date` | `client`, `is_root_page`, `rank`
`httparchive.all.requests` | `date` | `client`, `is_root_page`, `is_main_document`, `type`

For example, the `httparchive.all.pages` table is [partitioned](https://cloud.google.com/bigquery/docs/partitioned-tables) by `date` and [clustered](https://cloud.google.com/bigquery/docs/clustered-tables) on the `client`, `is_root_page`, and `rank` columns, which means that queries that filter on these columns will be much faster and cheaper than queries that don't.

Legacy tables like `httparchive.pages.2023_05_01_desktop`, however, do not take advantage of these optimizations and always incur the full cost of scanning the entire table.

Expand Down

0 comments on commit 699107c

Please sign in to comment.