Skip to content

Commit

Permalink
Added basic readme based on Pull Request details
Browse files Browse the repository at this point in the history
See GSA#94 for the full details of the
pull request.
  • Loading branch information
IanLee1521 committed Nov 29, 2017
1 parent 7e1e3df commit f57212c
Showing 1 changed file with 11 additions and 0 deletions.
11 changes: 11 additions & 0 deletions dotgov-websites/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# .gov websites

This directory contains already trivially publicly discoverable `.gov` hostname data.

There are two new snapshots of federal `.gov` websites:

- A snapshot of a set of ~20,000 hostnames from a search in Censys, performed in late November 2017, filtered to `.gov` and `.fed.us` hostnames that are subdomains of federal `.gov` domains, and filtered to only hosts that responded to HTTP/HTTPS over the public internet. This is snapshotted here ahead of Censys changing their technical and business model on December 1st. We anticipate continuing to get data from Censys after December 1st, but also expect there to be a gap, during which we'll use this snapshot.

- A snapshot of a set of ~9,000 hostnames from Rapid7's Reverse DNS v2 dataset, fileted to `.gov` and .fed.us hostnames that are subdomains of federal `.gov` domains, and filtered to only hosts that responded to HTTP/HTTPS over the public internet. This is snapshotted here because getting this data in an automated, dynamic way is currently pretty difficult (due to Rapid7's full dataset being a single ~130GB file when unzipped). Until we've invested in a way to automate this effectively and cheaply, this snapshot can be updated whenever it's convenient and useful.

We are already providing a snapshot of ~194,000 hostnames in the End-of-Term Archive 2016 dataset, converted to CSV. These hostnames are not filtered from what the EOT provided, though since they principally derive domains from a web crawl, they should generally refer to web services and not other services. (Many are defunct, however.) This moves the EOT data into the same directory as the above snapshots, for consistency. (A copy is left in the old directory to provide some transition help for those using the old URL.)

0 comments on commit f57212c

Please sign in to comment.