Open-source developers all over the world are working on millions of projects: writing code & documentation, fixing & submitting bugs, and so forth. GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.
GitHub provides 18 event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. The activity is aggregated in hourly archives, which you can access with any HTTP client:
Query | Command |
---|---|
Activity for March 11, 2012 at 3PM PST | wget http://data.githubarchive.org/2012-03-11-15.json.gz |
Activity for March 11, 2012 | wget http://data.githubarchive.org/2012-03-11-{0..23}.json.gz |
Activity for March 2012 | wget http://data.githubarchive.org/2012-03-{01..31}-{0..23}.json.gz |
Note: timeline data is available starting March 11, 2012.
Each archive contains a stream of JSON encoded GitHub events (sample), which you can process in any language. Ruby example:
require 'open-uri'
require 'zlib'
require 'yajl'
gz = open('http://data.githubarchive.org/2012-03-11-12.json.gz')
js = Zlib::GzipReader.new(gz).read
Yajl::Parser.parse(js) do |event|
print event
end
Note: example script to import data into SQLite db
GitHub Archive dataset is also available via Google BigQuery. The JSON data is normalized and is updated every hour, allowing you to run arbitrary queries and analysis over the entire dataset in seconds. To get started, login into the BigQuery console (bigquery.cloud.google.com), and add the project (name: "githubarchive"):
An example query, for more check the repository readme:
/* top 100 repos for Ruby by number of pushes */
SELECT repository_name, count(repository_name) as pushes, repository_description, repository_url
FROM [githubarchive:github.timeline]
WHERE type="PushEvent"
AND repository_language="Ruby"
AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
GROUP BY repository_name, repository_description, repository_url
ORDER BY watches DESC
LIMIT 100
(MIT License) - Copyright (c) 2012 Ilya Grigorik