-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added basic readme based on Pull Request details
See GSA#94 for the full details of the pull request.
- Loading branch information
1 parent
7e1e3df
commit f57212c
Showing
1 changed file
with
11 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# .gov websites | ||
|
||
This directory contains already trivially publicly discoverable `.gov` hostname data. | ||
|
||
There are two new snapshots of federal `.gov` websites: | ||
|
||
- A snapshot of a set of ~20,000 hostnames from a search in Censys, performed in late November 2017, filtered to `.gov` and `.fed.us` hostnames that are subdomains of federal `.gov` domains, and filtered to only hosts that responded to HTTP/HTTPS over the public internet. This is snapshotted here ahead of Censys changing their technical and business model on December 1st. We anticipate continuing to get data from Censys after December 1st, but also expect there to be a gap, during which we'll use this snapshot. | ||
|
||
- A snapshot of a set of ~9,000 hostnames from Rapid7's Reverse DNS v2 dataset, fileted to `.gov` and .fed.us hostnames that are subdomains of federal `.gov` domains, and filtered to only hosts that responded to HTTP/HTTPS over the public internet. This is snapshotted here because getting this data in an automated, dynamic way is currently pretty difficult (due to Rapid7's full dataset being a single ~130GB file when unzipped). Until we've invested in a way to automate this effectively and cheaply, this snapshot can be updated whenever it's convenient and useful. | ||
|
||
We are already providing a snapshot of ~194,000 hostnames in the End-of-Term Archive 2016 dataset, converted to CSV. These hostnames are not filtered from what the EOT provided, though since they principally derive domains from a web crawl, they should generally refer to web services and not other services. (Many are defunct, however.) This moves the EOT data into the same directory as the above snapshots, for consistency. (A copy is left in the old directory to provide some transition help for those using the old URL.) |