github-enricher
enriches GitHub data.
It accepts CSV-formatted input and emits CSV-formatted output. It's a good sidekick to GitHub's official bigquery dataset, which redacts email addresses.
github-enricher
is designed for fast incremental enrichment. Thus, it requires on Redis and the filesystem
for caching.
The REDIS_ADDR
and REDIS_PASSWORD
environment variables are used to configure the cache client.
Make sure your redis is persisting (e.g save 60 1
in your redis.conf).
Even though repos are shallow cloned, it can take minutes to retrieve a commit from a large repo. All repos
are cloned to the github-enricher
folder in your OS tempdir.
Name | Description | Dependencies |
---|---|---|
repo_name | Repository name, e.g torvalds/linux |
Cannot be enriched |
ref | e.g master |
Cannot be enriched |
user's email as captured from commit | repo_name, ref | |
name | user's full name as captured from commit | repo_name, ref |
username | GitHub login name | repo_name, ref |
gender | probable gender from first name hstove/gender | name |
firstname | first word in name | name |
lastname | last word in name | name |
All unrecogized columns are passed through verbatim.
The first line of the input and output is always a header.
This examples enriches email addresses from a list of commits. name
is passed through untouched.
input.csv:
TensorFlower Gardener,keras-team/keras,9b14e16b8cc93abcc21355115a7a18c34d385281
Chromium LUCI CQ,chromium/chromium,c33d4dbfd275d5659cc2c79cbec75810ae4bdd37
TypeScript Bot,kitsonk/TypeScript,2d80473c781818b1712c6106fd8b1faea59d25ae
GitHub,Azure/azure-sdk-for-python,23decbe4b61626b6a37f1f23dcf18514a2f445a5
shell invokation:
$ go run github.com/ammario/github-enricher < input.csv
name,repo_name,commit,email
TensorFlower Gardener,keras-team/keras,9b14e16b8cc93abcc21355115a7a18c34d385281,[email protected]
Chromium LUCI CQ,chromium/chromium,c33d4dbfd275d5659cc2c79cbec75810ae4bdd37,[email protected]
TypeScript Bot,kitsonk/TypeScript,2d80473c781818b1712c6106fd8b1faea59d25ae,[email protected]