Welcome to Domains Project!

World's single largest Internet domains dataset

This public dataset contains freely available sorted list of Internet domains.

Dataset statistics

Milestones:

Domains

(Wasted) Internet traffic:

500TB
925TB
1PB

Random facts:

More than 1TB of Internet traffic is just 3 Mbytes of compressed data
1 million domains is just 5 Mbytes compressed
More than 1TB of Internet traffic is necessary to get 316 million domains (3.4TB / 1 million).
Only 1.2Gb of disk space is required to store 316 million domains in compressed form
1Gbit fully saturated link is good for about 2 million new domains every day
8c/16t and 64 Gbytes of RAM machine is good for about 2 million new domains every day
2 ISC Bind9 instances (>400 Mbytes RSS each) are required to get 2 million new domains every day
After reaching 9 million domains repository was switched to compressed files. Please use freely available XZ to unpack files.
After reaching 30 million records, files were moved to /data so repository doesn't have it's README at the very bottom.

Used by

Using dataset

This repository empoys Git LFS technology, therefore user has to use both git lfs and xz to retrieve data. Cloning procedure is as follows:

git clone https://github.com/tb0hdan/domains.git
cd domains
./unpack.sh

Getting unfiltered dataset

Raw data may be available at https://dataset.domainsproject.org, though it is recommended to use Github repo.

wget -m https://dataset.domainsproject.org

Data format

After unpacking, domain lists are just text files (~6.8Gb at 316 mil) with one domain per line. Sample for data/afghanistan/domain2multi-af.txt:

1tv.af
1tvnews.af
3rdeye.af
8am.af
aan.af
acaa.gov.af
acb.af
acbr.gov.af
acci.org.af
ach.af
acku.edu.af
acsf.af
adras.af
aeiti.af

Search engines and crawlers

Crawlers

Domains Project bot

Domains Project uses crawler and DNS checks to get new domains.

Typical user agent for Domains Project bot looks like this:

Mozilla/5.0 (compatible; Domains Project/1.0.8; +https://domainsproject.org)

Some older versions have set to Github repo:

Mozilla/5.0 (compatible; Domains Project/1.0.4; +https://github.com/tb0hdan/domains)

All data in this dataset is gathered using Scrapy and Colly frameworks.

Crawler code for this project is available at: Domains Crawler

Starting with version 1.0.7 Domains Crawler has robots.txt support and rate limiting. Please open issue if you experience any problems. Don't forget to include your domain.

Others

Yacy

Yacy is a great opensource search engine. Here's my post on Yacy forum: https://searchlab.eu/t/domain-list-for-easier-search-bootstrapping/231

Additional sources

List of .FR domains from AfNIC.fr

Majestic Million

Internetstiftelsen Zone Data

DNS Census 2013

bigdatanews extract from Common Crawl (circa 2012)

Common Crawl - March/April 2020

The CAIDA UCSD IPv4 Routed /24 DNS Names Dataset - January/July 2019

Research

This dataset can be used for research. There are papers that cover different topics. I'm just going to leave links to them here for reference.

Re-registration and general statistics

Analysis of the Internet Domain Names Re-registration Market

Lexical analysis of malicious domains.

Detection of malicious domains through lexical analysis

Malicious Domain Names Detection Algorithm Based on Lexical Analysis and Feature Quantification

Detecting Malicious URLs Using Lexical Analysis

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
_layouts		_layouts
data		data
logos		logos
.gitattributes		.gitattributes
.gitignore		.gitignore
7bd86b236d7727307831aa078c9ac545.txt		7bd86b236d7727307831aa078c9ac545.txt
CNAME		CNAME
LICENSE		LICENSE
README.md		README.md
STATS.md		STATS.md
_config.yml		_config.yml
unpack.sh		unpack.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to Domains Project!

World's single largest Internet domains dataset

Milestones:

Domains

(Wasted) Internet traffic:

Used by

Using dataset

Getting unfiltered dataset

Data format

Search engines and crawlers

Crawlers

Domains Project bot

Others

Yacy

Additional sources

Research

Re-registration and general statistics

Lexical analysis of malicious domains.

About

Releases

Packages

Languages

License

cesnokov/domains

Folders and files

Latest commit

History

Repository files navigation

Welcome to Domains Project!

World's single largest Internet domains dataset

Milestones:

Domains

(Wasted) Internet traffic:

Used by

Using dataset

Getting unfiltered dataset

Data format

Search engines and crawlers

Crawlers

Domains Project bot

Others

Yacy

Additional sources

Research

Re-registration and general statistics

Lexical analysis of malicious domains.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages