Checkup is distributed, lock-free, self-hosted health checks and status pages, written in Go.
It features an elegant, minimalistic CLI and an idiomatic Go library. They are completely interoperable and their configuration is beautifully symmetric.
This tool is a WIP. Please use liberally with discretion and report any bugzies!
Checkup can be customized to check up on any of your sites or services at any time, from any infrastructure, using any storage provider of your choice. The status page can be customized to your liking since you can do your checks however you want.
Out of the box, Checkup currently supports:
- Checking HTTP endpoints
- Checking TCP endpoints (TLS supported)
- Storing results on S3
- Viewing results on a status page that is mobile-responsive and 100% static
There are 3 components:
-
Storage You set up storage space for the results of the checks.
-
Checks You run checks on whatever endpoints you have as often as you want.
-
Status Page You host the status page. Caddy makes this super easy. The status page downloads recent check files from storage and renders the results client-side.
$ checkup --help
Follow these instructions to get started quickly with Checkup.
You can configure Checkup entirely with a simple JSON document. We recommend you configure storage and at least one checker:
{
"checkers": [{
"type": "http",
"endpoint_name": "Example HTTP",
"endpoint_url": "http://www.example.com",
"attempts": 5
},
{
"type": "tcp",
"endpoint_name": "Example TCP",
"endpoint_url": "example.com:80",
"attempts": 5
},
{
"type": "tcp",
"endpoint_name": "Example TCP with TLS enabled and a valid certificate chain",
"endpoint_url": "example.com:443",
"attempts": 5,
"tls": true
},
{
"type": "tcp",
"endpoint_name": "Example TCP with TLS enabled and a self-signed certificate chain",
"endpoint_url": "example.com:8443",
"attempts": 5,
"timeout": "2s",
"tls": true,
"tls_cafile": "certs/ca.pem"
},
{
"type": "tcp",
"endpoint_name": "Example TCP with TLS enabled and verification disabled",
"endpoint_url": "example.com:8443",
"attempts": 5,
"timeout": "2s",
"tls": true,
"tls_skip_verify": true
}],
"storage": {
"provider": "s3",
"access_key_id": "<yours>",
"secret_access_key": "<yours>",
"bucket": "<yours>"
}
}
For the complete structure definition, please see the godoc. There are many elements of checkers and storage you may wish to customize!
Save this file as config.json
in your working directory.
The easiest way to do this is to give an IAM user these two privileges (keep the credentials secret):
- arn:aws:iam::aws:policy/IAMFullAccess
- arn:aws:iam::aws:policy/AmazonS3FullAccess
If you give these permissions to the same user as with the credentials in your JSON config above, then you can simply run:
$ checkup provision
and checkup will read the config file and provision S3 for you. If the user is different, you may want to use explicit provisioning instead.
This command creates a new IAM user with read-only permission to S3 and also creates a new bucket just for your check files. The credentials of the new user are printed to your screen. Make note of the Public Access Key ID and Public Access Key! You won't be able to see them again.
If you do not prefer implicit provisioning using your checkup.json file, do this instead. Export the information to environment variables and run the provisioning command:
$ export AWS_ACCESS_KEY_ID=...
$ export AWS_SECRET_ACCESS_KEY=...
$ export AWS_BUCKET_NAME=...
$ checkup provision s3
In statuspage/js, use the contents of config_template.js to fill out config.js, which is used by the status page. This is where you put the read-only S3 credentials you just generated.
Then, the status page can be served over HTTPS by running caddy -host status.mysite.com
on the command line. (You can use getcaddy.com to install Caddy.)
As you perform checks, the status page will update every so often with the latest results. Only checks that are stored will appear on the status page.
You can run checks many different ways: cron, AWS Lambda, or a time.Ticker in your own Go program, to name a few. Checks should be run on a regular basis. How often you run checks depends on your requirements and how much time you render on the status page.
For example, if you run checks every 10 minutes, showing the last 24 hours on the status page will require 144 check files to be downloaded on each page load. You can distribute your checks to help avoid localized network problems, but this multiplies the number of files by the number of nodes you run checks on, so keep that in mind.
Performing checks with the checkup
command is very easy.
Just cd
to the folder with your config.json
from earlier, and checkup will automatically use it:
$ checkup
The vanilla checkup command runs a single check and prints the results to your screen, but does not save them to storage for your status page.
To store the results instead, use --store
:
$ checkup --store
If you want Checkup to loop forever and perform checks and store them on a regular interval, use this:
$ checkup every 10m
And replace the duration with your own preference. In addition to the regular time.ParseDuration()
formats, you can use shortcuts like second
, minute
, hour
, day
, or week
.
You can also get some help using the -h
option for any command or subcommand.
Site reliability engineers should post messages when there are incidents or other news relevant for a status page. This is also very easy:
$ checkup message --about=Example "Oops. We're trying to fix the problem. Stay tuned."
This stores a check file with your message attached to the result for "Example" which you configured in config.json earlier.
Checkup is as easy to use in a Go program as it is on the command line.
(If you'd rather do this manually, see the instructions on the wiki.
First, create an IAM user with credentials as described in the section above.
Then go get github.com/sourcegraph/checkup
and import it.
Then replace ACCESS_KEY_ID
and SECRET_ACCESS_KEY
below with the actual values for that user. Keep those secret. You'll also replace BUCKET_NAME
with the unique bucket name to store your check files:
storage := checkup.S3{
AccessKeyID: "ACCESS_KEY_ID",
SecretAccessKey: "SECRET_ACCESS_KEY",
Bucket: "BUCKET_NAME",
}
info, err := storage.Provision()
if err != nil {
log.Fatal(err)
}
fmt.Println(info) // don't lose this output!
This method creates a new IAM user with read-only permission to S3 and also creates a new bucket just for your check files. The credentials of the new user are printed to your screen. Make note of the PublicAccessKeyID and PublicAccessKey! You won't be able to see them again.
First, go get github.com/sourcegraph/checkup
and import it. Then configure it:
c := checkup.Checkup{
Checkers: []checkup.Checker{
checkup.HTTPChecker{Name: "Example (HTTP)", URL: "http://www.example.com", Attempts: 5},
checkup.HTTPChecker{Name: "Example (HTTPS)", URL: "https://www.example.com", Attempts: 5},
checkup.TCPChecker{Name: "Example (TCP)", URL: "www.example.com:80", Attempts: 5},
checkup.TCPChecker{Name: "Example (TCP SSL)", URL: "www.example.com:443", Attempts: 5, TLSEnabled: true},
checkup.TCPChecker{Name: "Example (TCP SSL, validation disabled)", URL: "www.example.com:8443", Attempts: 5, TLSEnabled: true, TLSSkipVerify: true},
},
Storage: checkup.S3{
AccessKeyID: "<yours>",
SecretAccessKey: "<yours>",
Bucket: "<yours>",
Region: "us-east-1",
CheckExpiry: 24 * time.Hour * 7,
},
}
This sample checks 2 endpoints (HTTP and HTTPS). Each check consists of 5 attempts so as to smooth out the final results a bit. We will store results on S3. Notice the CheckExpiry
value. The checkup.S3
type is also checkup.Maintainer
type, which means it can maintain itself and purge any status checks older than CheckExpiry
. We chose 7 days.
Then, to run checks every 10 minutes:
c.CheckAndStoreEvery(10 * time.Minute)
select {}
CheckAndStoreEvery()
returns a time.Ticker
that you can stop, but in this case we just want it to run forever, so we block forever using an empty select
.
Simply perform a check, add the message to the corresponding result, and then store it:
results, err := c.Check()
if err != nil {
// handle err
}
results[0].Message = "We're investigating connectivity issues."
err = c.Storage.Store(results)
if err != nil {
// handle err
}
Of course, real status messages should be as descriptive as possible. You can use HTML in them.
Uh oh, having some fires? 🔥 You can create a type that implements checkup.Notifier
. Checkup will invoke Notify()
after every check, where you can evaluate the results and decide if and how you want to send a notification or trigger some event.
Need to check more than HTTP? S3 too Amazony for you? You can implement your own Checker and Storage types. If it's general enough, feel free to submit a pull request so others can use it too!