-
Notifications
You must be signed in to change notification settings - Fork 11
Design Doc
The DEFense readiness CONdition (DEFCON) is an alert state used by the United State Armed Forces. It is in some ways similar to the French national security alert system: VigiPirate.
At Criteo the alert state of production was stored in shared human memory (which doesn't offer ECC, transactions or a reliable remotely accessible API). This leads to scalability issues (a single team is supposed to be in charge of all changes in production because there is no automatic way to block them) and there is often a noticeable delay to propagate this state. Human memory is also hard to access by machines without invasive procedures.
This design doc attempts to propose a system that will:
- Have a single, reliable and clear source of truth that can be easily comprehended by both humans and machines.
- 80% of the state should be automatically guessed from arbitrary data sources, but it should be easy for people to override it manually for a given period of time .
- Part of the deployment pipeline should be using the new API to consider the project a success.
- Proposed solution should be able to easily apply to shards of Criteo (Teams/Perimeters)
The basic idea would be to provide a highly available API and UI and to integrate it to our deployment tools and procedures. The API and UI would give the caller the current alert level for a component (for instance, the HDFS clusters, Rivers scope, the whole production, ... ) and a list of rationales and links to explain this level.
While the level would usually be automatically set based on multiple data sources, a human or machine with appropriate credentials could do a manual override this level using this same API / UI.
It is not (yet) part of the scope of defcon to keep an history of the status changes.
Defcon doesn't provide a mechanism to evaluate complex alerting rules: it only exposes as clearly as possible the current alert level of a data source. This alert level is the lowest (ie. most critical) that is retrieved from data sources.
This is not supposed to be definitive but is here to give an example of what could be done.
Level | Description | Effects | Triggers (Examples) |
---|---|---|---|
1 | Company Wide Code Red / Freeze | All changes frozen appart from carefully reviewed hotfix. | Manual |
2 | Critical subsystems are degraded / Company Wide Code Yellow. Business is affected | Changes to critical subsystems frozen appart from carefully reviewed hotfix. | Manual, Business indicators |
3 | Critical subsystems are degraded, Business is not affected | Risky/Breaking changes are frozen. | Number of Critical Incidents, Perimeter SLAs |
4 | Some important subsystems are degraded | None | Perimeter SLAs |
5 | Everything is ok | None |
{
"status": {
"component": "production",
"code": "DC2",
"start_date": "2017-04-06T10:27:00+000",
},
"DC5": [],
"DC4": [
{
"title": "Data imports from NY are late",
"description": "Since 4h UTC we notice some lag on the import of
DATA.",
"link": "https://jira/browse/INCIDENT-1337",
"source": "jira-incident",
"component": "Rivers",
"start_date": "2017-04-06T06:04:21+000",
},
],
"DC3": [],
"DC2": [
{
"title": "Blocker: Abnormal rate of error 500 from TLAs",
"description": "Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Pellentesque faucibus imperdiet dolor.",
"link": "https://jira/browse/INCIDENT-4321",
"source": "jira-incident",
"component": "RTB",
"start_date": "2017-04-06T10:27:00+000",
},
{
"title": "ImportDB NY is late of 6H",
"link": "http://app.marathon.preprod/bdp-legacy",
"source": "slab",
"component": "bdp",
"start_date": "2017-04-06T11:03:26+000",
},
{
"title": "Too much sun: everybody is on the rooftop",
"source": "p.crastin",
"component": "production",
"start_date": "2017-04-06T14:22:26+000",
},
],
"DC1": [],
}
This section describe the tools that could be used as data source to compute the current status. It also describes the potential users of the status (both humans and machines). This list is not supposed to be exhaustive but just list a few integration that could be done in the first few iterations of the project.
- Humans (with overrides in case automated data sources did not pick up the signal)
- JIRA (for example counting the number of blocker/critical Incideent)
- Graphite / Prometheus (which contain SLA timeseries)
- Freeze calendar (which might end up being JIRA or Outlook ...)
- Humans: dashboards (including for ICCs, MRMs, Interrupts..)
- Jenkins: job pushing to production could be waiting on the defcon status to be less than a threshold <4
- Rundeck: same as jenkins
- Chef: we could imagine that chef would be stopped on defcon <= 2 unless manually started
Probably using Django and Swagger which should be easily deployable on Mesos. The project will need to support plugins in order to make it easy to open-source the core part of the code.
Plugins will run periodically and be used to compute an alert level (and associated rationales) from a various list of data sources like Jira, Graphite, Alertmanager and others. One plugin can expose an API endpoint allowing to push overrides from the UI.
While initially supporting a single component, it should eventually be made to support a hierarchy of component to be able to bubble-up alert states to the root level.
A plugin will expose a list of rationales and the associated alert level:
[
{
"level": varchar(3), // DC1, DC2, ..., DC5
"id": varchar(255), // ID allowing the plugin to deduplicate reports (?)
"title": varchar(255),
"source": varchar(255) // most of the time, plugin name
// Optional fields:
"component": varchar(255), // by default the root component
"startdate": datetime,
"link": url,
"description": text,
}
]
Defcon resolves the output of the plugins by exposing the most critical level.
For instance:
- the Prometheus plugin retrieves the firing alerts and set the defcon level according to the severity of the alert,
- the JIRA plugin retrieves currently open BES, Code Red sets the level DC1, Code Yellow DC2, etc.
The goal is to allow anybody to start its own instance of defcon, but also to host a default instance shared by various teams. The default instance would be hosted on marathon on at least two datacenters.
The core of defcon will be opensource and installable via pip, so will be plugins that can be open-sourced.
defcon will use the standard config file of its framework to configure itself. If we go with django it will simply be a local_settings.py file (which will be static and packaged with each new version). Some variables may end up being overridable using env variables.
Since defcon is multi-tenant by nature (you can configure a set of plugin for each component) we will probably add subdirectories to let teams configure their own components without reviews from the repositories owner. We will start with automatic weekly releases (after a successful integration test). If it proves frustrating to users, we might look into a more dynamic way to deploy plugin configuration.
- Use https://www.statuspage.io, https://www.status.io/, or similar: We haven't found any tool that would be easily hostable in-house, would support pluggable data sources and more that two status level.
Prometheus would be used to check that at least one healthy node is currently available and successfully answering requests.
Outage of defcon may result in delays in production releases or block some systems. The behavior of systems affected by the defcon status must be carefully thought in order to let humans operate effectively under exceptional circumstances (ie: Chef must not fail to start if defcon is not reachable).