Skip to content
@commoncrawl

Common Crawl Foundation

Common Crawl provides an archive of webpages going back to 2007.

Pinned Loading

  1. cc-pyspark Public

    Process Common Crawl data with Python and Spark

    Python 428 89

  2. cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    Python 177 11

  3. cc-index-table Public

    Index Common Crawl archives in tabular format

    Java 118 10

  4. cc-warc-examples Public

    Forked from Smerity/cc-warc-examples

    CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

    Java 38 18

  5. cc-citations Public

    Scientific articles using or citing Common Crawl data

    Jupyter Notebook 20 3

  6. cc-notebooks Public

    Various Jupyter notebooks about Common Crawl data

    Jupyter Notebook 52 10

Repositories

Showing 10 of 70 repositories
  • cc-host-index Public

    Tools for working with the host index

    Python 1 0 0 0 Updated Apr 27, 2025
  • cc-downloader Public

    A polite and user-friendly downloader for Common Crawl data

    Rust 41 Apache-2.0 1 2 (1 issue needs help) 0 Updated Apr 22, 2025
  • web-languages Public

    Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

    39 43 0 0 Updated Apr 21, 2025
  • cc-host-index-media Public

    Media files used in the README.d of cc-host-index

    HTML 0 0 0 0 Updated Apr 20, 2025
  • cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    Python 177 Apache-2.0 11 0 0 Updated Apr 14, 2025
  • wac2025-cc-annotator-poster Public

    A proof of concept pipeline for WARC annotation

    Rust 1 Apache-2.0 0 0 0 Updated Apr 10, 2025
  • cc-webgraph-statistics Public

    Statistics of Common Crawl monthly Web Graphs

    Python 4 Apache-2.0 0 0 0 Updated Apr 10, 2025
  • wac2025-webgraph-workshop Public

    Introduction to WebGraphs - Workshop at the IIPC Web Archiving Conference 2025

    Shell 3 MIT 0 0 0 Updated Apr 10, 2025
  • cc-webgraph Public

    Tools to construct and process Common Crawl webgraphs

    Java 90 Apache-2.0 5 2 (1 issue needs help) 0 Updated Apr 4, 2025
  • arc2warc-conversion Public

    Experiences converting Common Crawl's ARC files from the crawls 2008 - 2012 to the WARC format

    0 0 0 0 Updated Apr 3, 2025