Skip to content

Commit

Permalink
Datasets and schema updates (#15)
Browse files Browse the repository at this point in the history
* pruned summary

* custom metrics split

* cleaned up payload

* summary pruned

* structs

* queries update

* update schemas

* update schema

* examples updated

* types cleanup

* crawl and page ids removed

* page ids removed from metadata

* underscore

* doc updates

* image updates

* 1 week crawl queue

* query formatting

* title fix

* query result updates

* routines updates

* formatting

* move image
  • Loading branch information
max-ostapenko authored Nov 20, 2024
1 parent 49ddadf commit a61dc0b
Show file tree
Hide file tree
Showing 37 changed files with 25,141 additions and 24,821 deletions.
6 changes: 3 additions & 3 deletions astro.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,11 @@ export default defineConfig({
}
],
social: {
github: 'https://github.com/rviscomi/har.fyi',
github: 'https://github.com/HTTPArchive/har.fyi',
twitter: 'https://twitter.com/HTTPArchive',
},
editLink: {
baseUrl: 'https://github.com/rviscomi/har.fyi/edit/main/'
baseUrl: 'https://github.com/HTTPArchive/har.fyi/edit/main/'
},
sidebar: [
{
Expand All @@ -40,7 +40,7 @@ export default defineConfig({
{ label: 'Minimizing query costs', link: '/guides/minimizing-costs/' },
{ label: 'Guided tour', link: '/guides/guided-tour/' },
{ label: 'Release cycle', link: '/guides/release-cycle/' },
{ label: 'Migrate queries to `all` dataset', link: '/guides/migrating-to-all-dataset/' },
{ label: 'Migrate queries to `crawl` dataset', link: '/guides/migrating-to-crawl-dataset/' },
],
},
{
Expand Down
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/content/docs/guides/bigquery-pages.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified src/content/docs/guides/bigquery-query-in-a-new-tab.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed src/content/docs/guides/bigquery-summary_pages.png
Binary file not shown.
File renamed without changes.

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions src/content/docs/guides/guided-tour.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@ If you are new to BigQuery, then the [Getting Started guide](../getting-started/
Migration Guides:

- If you are looking to adapt older HTTP Archive queries, written in [Legacy SQL](https://cloud.google.com/bigquery/docs/reference/legacy-sql), then you may find this [migration guide](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql) helpful.*
- If you've been working with the deprecated dataset `pages` or `requests`, there is a guide on [migrating your queries to the `all` dataset](/guides/migrating-to-all-dataset/).
- If you've been working with the deprecated dataset `pages` or `requests`, there is a guide on [migrating your queries to the `crawl` dataset](/guides/migrating-to-crawl-dataset/).

This guide is split into multiple sections, each one focusing on different tables in the HTTP Archive. Each section builds on top of the previous one:

1. [Exploring the `httparchive.all.pages` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-pages_tables.ipynb)
2. [Exploring the `httparchive.all.requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-requests_tables.ipynb)
3. [JOINing `pages` and `requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_pages_and_requests_tables_joined.ipynb)
1. [Exploring the `httparchive.crawl.pages` tables](https://colab.research.google.com/github/HTTPArchive/har.fyi/blob/main/workbooks/exploring_httparchive-all-pages_tables.ipynb)
2. [Exploring the `httparchive.crawl.requests` tables](https://colab.research.google.com/github/HTTPArchive/har.fyi/blob/main/workbooks/exploring_httparchive-all-requests_tables.ipynb)
3. [JOINing `pages` and `requests` tables](https://colab.research.google.com/github/HTTPArchive/har.fyi/blob/main/workbooks/exploring_pages_and_requests_tables_joined.ipynb)

:::caution
HTTP Archive uses clustered tables. BigQuery [doesn't guarantee](https://cloud.google.com/bigquery/docs/clustered-tables#clustered_table_pricing:~:text=BigQuery%20might%20not%20be%20able%20to%20accurately%20estimate%20the%20bytes%20to%20be%20processed) accuracy of estimations for bytes to be processed when querying clustered tables. For your information the actual bytes processed amount is provided in a comment for each query.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
---
title: Migrate queries to `all` dataset
title: Migrate queries to `crawl` dataset
description: Assisting with query migration to the new dataset
---

import { Tabs, TabItem } from '@astrojs/starlight/components';

New tables have been introduced in the HTTP Archive dataset, which are more efficient and easier to use. The `all` dataset contains all the data from the previous `pages`, `requests`, and other datasets. This guide will help you migrate your queries to the new dataset.
New tables have been introduced in the HTTP Archive dataset, which are more efficient and easier to use. The `crawl` dataset contains all the data from the previous `pages`, `requests`, and other datasets. This guide will help you migrate your queries to the new dataset.

## Migrating to `all.pages`
## Migrating to `crawl.pages`

### Page data schemas comparison

previously | `all.pages`
previously | `crawl.pages`
---|---
date in a table name | [`date`](/reference/tables/pages/#date)
client as `_TABLE_SUFFIX` | [`client`](/reference/tables/pages/#client)
Expand Down Expand Up @@ -41,8 +41,9 @@ SELECT
type,
id
FROM `httparchive.blink_features.features`
WHERE yyyymmdd = DATE('2024-05-01')
AND client = 'desktop'
WHERE
yyyymmdd = DATE('2024-05-01') AND
client = 'desktop'
```
</TabItem>
<TabItem label="After">
Expand All @@ -52,11 +53,12 @@ SELECT
features.feature,
features.type,
features.id
FROM `httparchive.all.pages`,
FROM `httparchive.crawl.pages`,
UNNEST (features) AS features
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
is_root_page
```
</TabItem>
</Tabs>
Expand All @@ -77,11 +79,12 @@ FROM `httparchive.lighthouse.2024_06_01_desktop`
/* This query will process 17 TB when run. */
SELECT
page,
JSON_QUERY(lighthouse, '$.audits.largest-contentful-paint.numericValue') AS LCP,
FROM `httparchive.all.pages`
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
lighthouse.audits.`largest-contentful-paint`.numericValue AS LCP,
FROM `httparchive.crawl.pages`
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
is_root_page
```
</TabItem>
</Tabs>
Expand All @@ -107,10 +110,11 @@ SELECT
client,
wptid,
-- JSON with the results of the custom metrics,
JSON_QUERY(custom_metrics, '$.privacy') AS custom_metrics,
FROM `httparchive.all.pages`
WHERE date = '2022-06-01'
AND is_root_page
custom_metrics.privacy AS custom_metrics,
FROM `httparchive.crawl.pages`
WHERE
date = '2022-06-01' AND
is_root_page
```
</TabItem>
</Tabs>
Expand All @@ -125,31 +129,26 @@ SELECT
COUNT(0) pages,
ROUND(AVG(reqTotal),2) avg_requests,
FROM `httparchive.summary_pages.2024_06_01_desktop`
GROUP BY
numDomains
HAVING
pages > 1000
ORDER BY
numDomains ASC
GROUP BY numDomains
HAVING pages > 1000
ORDER BY numDomains ASC
```
</TabItem>
<TabItem label="After">
```sql
/* This query will process 110 GB when run. */
SELECT
CAST(JSON_VALUE(summary, '$.numDomains') AS INT64) AS numDomains,
INT64(summary.numDomains) AS numDomains,
COUNT(0) pages,
ROUND(AVG(CAST(JSON_VALUE(summary, '$.reqTotal') AS INT64)),2) as avg_requests,
FROM `httparchive.all.pages`
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
GROUP BY
numDomains
HAVING
pages > 1000
ORDER BY
numDomains ASC
ROUND(AVG(INT64(summary.reqTotal)),2) as avg_requests,
FROM `httparchive.crawl.pages`
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
is_root_page
GROUP BY numDomains
HAVING pages > 1000
ORDER BY numDomains ASC
```
</TabItem>
</Tabs>
Expand All @@ -175,21 +174,22 @@ SELECT
technologies.categories,
technologies.technology,
technologies.info
FROM `httparchive.all.pages`,
FROM `httparchive.crawl.pages`,
UNNEST (technologies) AS technologies
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
is_root_page
```

</TabItem>
</Tabs>

## Migrating to `all.requests`
## Migrating to `crawl.requests`

### Request data schemas comparison

previously | `all.requests`
previously | `crawl.requests`
---|---
date in a table name | [`date`](/reference/tables/requests/#date)
client as `_TABLE_SUFFIX` | [`client`](/reference/tables/requests/#client)
Expand Down Expand Up @@ -218,22 +218,24 @@ SELECT
JSON_VALUE(request_headers, '$.value') AS header_value,
FROM `httparchive.almanac.requests`,
UNNEST(JSON_QUERY_ARRAY(request_headers)) AS request_headers
WHERE date = '2024-06-01'
AND client = 'desktop'
AND firstHtml
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
firstHtml
```
</TabItem>
<TabItem label="After">
```sql
SELECT
LOWER(request_headers.name) AS header_name,
request_headers.value AS header_value,
FROM `httparchive.all.requests`,
FROM `httparchive.crawl.requests`,
UNNEST(request_headers) AS request_headers
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_main_document
AND is_root_page
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
is_main_document AND
is_root_page
```
</TabItem>
</Tabs>
Expand All @@ -256,12 +258,13 @@ FROM `httparchive.requests.2024_06_01_desktop`
SELECT
page,
url,
JSON_VALUE(summary, '$.mimeType') AS mimeType,
CAST(JSON_VALUE(summary, '$.respBodySize') AS INT64) AS respBodySize,
FROM `httparchive.all.requests`
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
STRING(summary.mimeType) AS mimeType,
INT64(summary.respBodySize) AS respBodySize,
FROM `httparchive.crawl.requests`
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
is_root_page
```
</TabItem>
</Tabs>
Expand All @@ -286,10 +289,11 @@ SELECT
page,
url,
BYTE_LENGTH(response_body) AS bodySize
FROM `httparchive.all.requests`
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
FROM `httparchive.crawl.requests`
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
is_root_page
```
</TabItem>
</Tabs>
Expand All @@ -313,12 +317,13 @@ ORDER BY responseSize100KB ASC
```sql
/* This query will process 10 TB when run. */
SELECT
ROUND(CAST(JSON_VALUE(summary, '$.respBodySize') AS INT64)/1024/100)*100 AS responseSize100KB,
ROUND(INT64(summary.respBodySize)/1024/100)*100 AS responseSize100KB,
COUNT(0) requests,
FROM `httparchive.all.requests`
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
FROM `httparchive.crawl.requests`
WHERE
date = '2024-06-01' AND
client = 'desktop' AND
is_root_page
GROUP BY responseSize100KB
HAVING responseSize100KB > 0
ORDER BY responseSize100KB ASC
Expand Down
28 changes: 14 additions & 14 deletions src/content/docs/guides/minimizing-costs.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ The HTTP Archive dataset is large and complex, and it's easy to write queries th

Table | Partitioned by | Clustered by
--- | --- | ---
`httparchive.all.pages` | `date` | `client`<br>`is_root_page`<br>`rank`
`httparchive.all.requests` | `date` | `client`<br>`is_root_page`<br>`is_main_document`<br>`type`
`httparchive.crawl.pages` | `date` | `client`<br>`is_root_page`<br>`rank`<br>`page`
`httparchive.crawl.requests` | `date` | `client`<br>`is_root_page`<br>`is_main_document`<br>`type`

For example, the `httparchive.all.pages` table is [partitioned](https://cloud.google.com/bigquery/docs/partitioned-tables) by `date` and [clustered](https://cloud.google.com/bigquery/docs/clustered-tables) by the `client`, `is_root_page`, and `rank` columns, which means that queries that filter on these columns will be much faster and cheaper than queries that don't.
For example, the `httparchive.crawl.pages` table is [partitioned](https://cloud.google.com/bigquery/docs/partitioned-tables) by `date` and [clustered](https://cloud.google.com/bigquery/docs/clustered-tables) by the `client`, `is_root_page`, `rank` and `page` columns, which means that queries that filter on these columns will be much faster and cheaper than queries that don't.

:::caution
BigQuery [doesn't guarantee](https://cloud.google.com/bigquery/docs/clustered-tables#clustered_table_pricing:~:text=BigQuery%20might%20not%20be%20able%20to%20accurately%20estimate%20the%20bytes%20to%20be%20processed) accuracy of estimations for 'Bytes processed' when querying clustered tables ([Issue Link](https://issuetracker.google.com/issues/176795805)). The actual data volume may be smaller than the amount provided in the estimate.
Expand All @@ -27,7 +27,7 @@ Filter by the top 1k websites. This is the smallest rank bucket and will result
SELECT
page
FROM
`httparchive.all.pages`
`httparchive.crawl.pages`
WHERE
date = '2023-05-01' AND
client = 'desktop' AND
Expand All @@ -44,9 +44,9 @@ For example, without `TABLESAMPLE`:

```sql
SELECT
JSON_VALUE(custom_metrics, '$.avg_dom_depth') AS dom_depth
custom_metrics.other.avg_dom_depth
FROM
`httparchive.all.pages`
`httparchive.crawl.pages`
WHERE
date = '2023-05-01' AND
client = 'desktop'
Expand All @@ -58,9 +58,9 @@ However, the same query with `TABLESAMPLE` at 0.01% is much cheaper:

```sql
SELECT
JSON_VALUE(custom_metrics, '$.avg_dom_depth') AS dom_depth
custom_metrics.other.avg_dom_depth
FROM
`httparchive.all.pages` TABLESAMPLE SYSTEM (0.01 PERCENT)
`httparchive.crawl.pages` TABLESAMPLE SYSTEM (0.01 PERCENT)
WHERE
date = '2023-05-01' AND
client = 'desktop'
Expand All @@ -77,9 +77,9 @@ For example, this query still processes 6.56 TB:

```sql
SELECT
JSON_VALUE(custom_metrics, '$.avg_dom_depth') AS dom_depth
custom_metrics.other.avg_dom_depth
FROM
`httparchive.all.pages`
`httparchive.crawl.pages`
WHERE
date = '2023-05-01' AND
client = 'desktop'
Expand All @@ -91,16 +91,16 @@ LIMIT

## Use the `sample_data` dataset

The `sample_data` dataset contains 1k and 10k subsets of the full pages and requests tables. These tables are useful for testing queries before running them on the full dataset, without the risk of incurring a large query cost.
The `sample_data` dataset contains 10k subsets of the full pages and requests tables. These tables are useful for testing queries before running them on the full dataset, without the risk of incurring a large query cost.

Table names correspond to their full-size counterparts of the form `[table]_[client]_10k` for the legacy tables or `[table]_1k` for the newer `all.pages` and `all.requests` tables. For example, to query the summary data for the subset of 10k pages, you would use the `httparchive.sample_data.summary_pages_desktop_10k` table.
Table names correspond to their full-size counterparts of the form `[table]_1k` for `crawl.pages` and `crawl.requests` tables. For example, to query the summary data for the subset of 10k pages, you would use the `httparchive.sample_data.pages_10k` table.

## Use table previews

BigQuery allows you to preview entire rows of a table without incurring a query cost. This is useful for getting a rough idea of the data in a table before running a more expensive query.

![Preview tab on BigQuery](../../../assets/bq-preview.webp)
![Preview tab on BigQuery](./bq-preview.webp)

To access the preview, click on a table name from the workspace explorer and select the **Preview** tab.

Note that generating the preview may be slow for tables with large payloads, like `response_bodies` or `pages`. Also note that the text values are truncated by default, so you will need to expand the field to get the full value.
Note that generating the preview may be slow for these tables as they include large payloads. Also note that the text values are truncated by default, so you will need to expand the field to get the full value.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ description: Learn about the process of testing millions of web pages each month

The HTTP Archive dataset is updated each month with data from millions of web pages. This guide explores the end-to-end release cycle from sourcing URLs to publishing results to BigQuery.

_TODO: Add a diagram_
[](https://www.plantuml.com/plantuml/uml/RL5DRnCn4BtxLppr60aaEFQ0Aie9eHAQ5BXExEckXMCRUvme5tuxpdgtkrNNKYpvPTxi-xZBGadAqIb3GWVAZFlqz5l5Ybfj8tcf09qTfrVOraPsrZCeu-PBfRuWDujDBXIpav2eQuC3W8RKGHsSOoqs-8n7ZY59dicVRVUZSBeezH244KwS1cctAB4EiK7m-EWDzeMpeGl2CwHd78ENNgbHzBjFZRCB9Udc3K-Ftx8YBVP4mY_kK8-QJApHZYp9wWLp6bOBscZZ5jjoS3Rtk0-9yOiF-6c5NCQUTU-32zq5QPXLXjziR6Aw54fi-l2tS66OakWQ5_xX0yxCVuP15q94v8G3YUwlEKJgE2kCPuvYqKVrHgT1sRQ-zfm5ShqIv-8au-lk-yFR3LCfZVsQORq4PA7E-WwrH3TAO6zK_IqgcMpYTdJtR7tDYatpRNYzd9R7sKflFGYrym5VJszwp9hdJWm9GG8scruaKjAzFV5xVVtMPZFycrdcBUl5jlQQnPKA1yjtzIf7zny0)
![Release cycle diagram](./release_cycle_diagram.svg)

## Sourcing URLs

Expand All @@ -18,17 +19,13 @@ CrUX also includes origins without any distinct form factor data. HTTP Archive c
Previously, HTTP Archive would start testing each web page (the crawl) on the first of the month. Now, to be in closer alignment with the upstream CrUX dataset, HTTP Archive starts testing pages as soon as the CrUX dataset is available on the second Tuesday of each month. Crawl dates are always rounded down to the first of the month, regardless of which day they actually started. For example, the June 2023 crawl kicks off on the 13th of the month, but the dataset would be accessible on BigQuery under the date `2023-06-01`.

:::note
As of [May 2023](https://httparchive.org/reports/state-of-the-web?start=2023_04_01&end=2023_05_01&view=list#numUrls) there are 16.6 million mobile pages and 12.8 million desktop pages. It takes 1–2 weeks to test all of these pages, so the crawl is usually complete by the end of the month.
As of [May 2023](https://httparchive.org/reports/state-of-the-web?start=2023_04_01&end=2023_05_01&view=list#numUrls) there are 16.6 million mobile pages and 12.8 million desktop pages. It takes 1–2 weeks to test all of these pages, so the crawl is usually complete in the second half of the month.
:::

## Publishing the raw data

As each page's test results are completed, the raw data is saved to a public Google Cloud Storage bucket. Once the crawl is complete, the data is processed and published to BigQuery. The BigQuery dataset is available to the public for analysis.

There isn't currently a way to be notified when a new crawl is available to query.
As each page's test results are completed, the raw data is saved to a public Google Cloud Storage bucket. Once the crawl is complete, the data is processed and published to BigQuery. The `httparchive.crawl` dataset is available to the public for analysis.

## Generating reports

The reports on the HTTP Archive website are automatically generated as soon as the BigQuery data is available.

Auxilliary reports like the [Core Web Vitals Technology Report](https://cwvtech.report/) are generated manually soon after the data becomes available.
The reports on the [HTTP Archive website](https://httparchive.org/reports) and auxilliary ones like the [Core Web Vitals Technology Report](https://httparchive.org/reports/techreport/landing) are automatically generated as soon as the data is available in BigQuery.
Loading

0 comments on commit a61dc0b

Please sign in to comment.