Skip to content

Commit

Permalink
Technology example queries (#3)
Browse files Browse the repository at this point in the history
example queries
  • Loading branch information
rviscomi authored Oct 21, 2023
1 parent 11e8574 commit 53131cf
Showing 1 changed file with 67 additions and 2 deletions.
69 changes: 67 additions & 2 deletions src/content/docs/reference/structs/technology.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,72 @@ description: Reference docs for the technology struct

_Appears in: [`pages` table](/reference/tables/pages/)_

Technologies are detected by [Wappalyzer](https://www.wappalyzer.com/). Refer to the [Wappalyzer repository](https://github.com/wappalyzer/wappalyzer) on GitHub to request a new technology detection or to browse the source code of existing detections.
Technologies are detected by [Wappalyzer](https://www.wappalyzer.com/). Refer to HTTP Archive's fork of the [Wappalyzer repository](https://github.com/HTTPArchive/wappalyzer) on GitHub to request a new technology detection or to browse the source code of existing detections.

## Example queries

### Pages using WordPress in the top 5k

As the `technologies` field is a repeated struct, we need to use `UNNEST` to query it.

```sql
SELECT DISTINCT
root_page
FROM
`httparchive.all.pages`,
UNNEST(technologies) AS t
WHERE
date = '2023-09-01' AND
rank = 1000 AND
t.technology = 'WordPress'
```

### Top 10 CMSs

Within the `technologies` field, the `categories` field is also repeated. We can use `UNNEST` to query it as well.

It's straightforward to detect whether a page uses a technology. However, to generalize that to an entire website (or origin), we detect if either its `root_page` or secondary page use it. To handle this in the query, we count the distinct number of pages' `root_page` fields.

```sql
SELECT
t.technology AS cms,
COUNT(DISTINCT root_page) AS sites
FROM
`httparchive.all.pages`,
UNNEST(technologies) AS t,
UNNEST(t.categories) AS category
WHERE
date = '2023-09-01' AND
category = 'CMS'
GROUP BY
cms
ORDER BY
sites DESC
LIMIT
10
```

### Top 5 WordPress versions

There is usually only one technology version on a given page, but in some cases a site uses the same technology twice. For example, multiple widgets load different versions of jQuery.

To account for these edge cases, the `info` field is also repeated, so we need to use `UNNEST` to query it as well.

Also note that some pages omit version numbers, so you may see empty or null values in the results.

Regular expressions can be used to parse major version numbers, for example `REGEXP_EXTRACT(version, r'^(\d+)')`. Beware of garbage values, as the version info is extracted from the source HTML. For example, you may encounter a subset of pages with a version number that hasn't even been released yet.

```sql
SELECT
APPROX_TOP_COUNT(version, 10)
FROM
`httparchive.all.pages`,
UNNEST(technologies) AS t,
UNNEST(t.info) AS version
WHERE
date = '2023-09-01' AND
t.technology = 'WordPress'
```

## Schema

Expand All @@ -25,4 +90,4 @@ List of categories to which this technology belongs

### `info`

Additional metadata about the detected technology, ie version number
Additional metadata about the detected technology, ie version number

0 comments on commit 53131cf

Please sign in to comment.