This fakes the injection while still going through the process of selecting targets, creating chaos pods, and simulating the disruption as much as possible. Put a different way, all "read" operations (like knowing which network interface to disrupt) will be executed while all "write" operations won't be (like creating what's needed to drop packets). Checkout this example.
A disruption can be applied either at the pod
level or at the node
level:
- When applied at the
pod
level, the controller will target pods and will affect only the targeted pods. Other pods running on the same node as those targeted may still be affected depending on the injected disruption. - When applied at the
node
level, the controller will target nodes and will potentially affect everything running on the node (other processes).
Let's imagine a node with two pods running: foo
and bar
and a disruption dropping all outgoing network packets:
- Applying this disruption at the
pod
level and with a selector targeting thefoo
pod will result with thefoo
pod not being able to send any packets, but thebar
pod will still be able to send packets, as well as other processes on the node. - Applying this disruption at the
node
level and with a selector targeting the node itself, bothfoo
andbar
pods won't be able to send network packets anymore, as well as all the other processes running on the node.
The Disruption
spec takes a duration
field. This field represents amount of time after the disruption's creation before
all chaos pods automatically terminate and the disruption stops injecting new ones. This field takes a string, which is meant to conform to
golang's time.Duration's string format, e.g., "45s", "15m30s", "4h30m". This time is measured from the moment
that the Disruption resource is created, not from when the injection of the actual failure occurs. It functions as a strict maximum for the lifetime of the Disruption, not a guarantee of how long the failure will persist for.
If a duration
is not specified, then a disruption will receive the default duration, which is configured at the controller level by setting
controller.defaultDuration
in the controller's config map, and this value defaults to 1 hour.
After a disruption's duration expires, the disruption resource will live in k8s for a default of 10 minutes. This can be configured by altering
controller.expiredDisruptionGCDelay
in the controller's config map.
The Disruption
spec takes a pulse
field. It activates the pulsing mode of the disruptions of type cpu_pressure
, disk_pressure
, dns_disruption
, grpc_disruption
or network_disruption
. A "pulsing" disruption is one that alternates between an active injected state, and an inactive dormant state. Previously, one would need to manage the Disruption lifecycle by continually re-creating and deleting a Disruption to achieve the same effect.
It is composed of two subfields: dormantDuration
and activeDuration
, which both take a string, which is meant to conform to
golang's time.Duration's string format, e.g., "45s", "15m30s", "4h30m". and have to be greater than 500 milliseconds.
dormantDuration
will specify the duration of the disruption being dormant
, meaning that the disruption will not be injected during that time.
activeDuration
will specify the duration of the disruption being active
, meaning that the disruption will be injected during that time.
The pulsing disruption will be injected for a duration of activeDuration
, then be clean and dormant for a duration of dormantDuration
, and so on until the end of the disruption.
If a pulse
is not specified, then a disruption will not be pulsing.
NEW: StaticTargeting currently defaults to false. Please mention explicitely if you wish to activate it. Read StaticTargeting.
The Disruption
resource uses label selectors to target pods and nodes. The controller will retrieve all pods or nodes matching the given label selector and will randomly select a number (defined in the count
field) of matching targets. It's possible to specify multiple label selectors, in which case the controller will select from targets that match all of them. Once applied, you can see the targeted pods/nodes by describing the Disruption
resource.
NOTE: If you are targeting pods, the disruption must be created in the same namespace as the targeted pods.
By default, there is a constant re-targeting for disruptions. This means at any given time, any target within the selector's scope will be added to the target list and be disrupted. Although default, this is to use with care as a disruption gone wrong can quickly get out of control: per example, a disruption targeting 100% of an application's pod will affect all existing and future pods which can appear once the disruption started. As long as this 100% disruption exists, there will be no spared pod.
DynamicTargeting
behaviour design choices:
- the controller will consider as a still-alive target any pod that exists - regardless of its state.
- the controller will reconcile/update its targets list on any chaos or selector pod movement (create, update, delete)
Activate StaticTargeting
to limit the disruption to a single target selection step at the disruption's creation. It allows for more controlled disruption impact and propagation, as the targets will never change and can be compensated for in case they are made useless. Its major limit is not being able to follow targets through deployments/rollouts.
When enabled in the configuration (controller.enableSafeguards
field), safeguards will exclude some targets from the selection to avoid unexpected issues:
- if the disruption is applied at the node level, the node where the controller is running on can't be selected
- if the disruption is applied at the pod level with a node disruption, the node where the controller is running on can't be selected
In addition to the simple selector
field matching an exact key/value label, one can do some more advanced targeting with the advancedSelector
field. It uses the label selector requirements mechanism allowing to match labels with the following operator:
Exists
: the label with the specified key is present, no matter the valueDoesNotExist
: the label with the specified key is not presentIn
: the label with the specified key has a value strictly equal to one of the given valuesNotIn
: the label with the specified key has a value not matching any of the given values
You can look at an example of the expected format to know how to use it.
How can you target a specific pod by name, if it doesn't have a unique label selector you can use? The Disruption
spec doesn't support field selectors at this time, so selecting by name isn't possible. However, you can use the kubectl label pods
command, e.g., kubectl label pods $podname unique-label-for-this-disruption=target-me
to dynamically add a unique label to the pod, which you can use as your label selector in the Disruption
spec.
By default, a disruption affects all containers within the pod. You can restrict the scope of the disruption to a single container or to only some containers like this.
📝 This mode has some restrictions:
- it requires a 1.15+ Kubernetes cluster
- it requires the
--handler-enabled
flag on the controller container- it only works for network related (network and dns) disruptions
- it only works with the pod level
- it does not support containers scoping (applying a disruption to only some containers)
It can be handy to disrupt packets on pod initialization, meaning before containers are actually created and started, to test startup dependencies or init containers. You can do this in only two steps:
- redeploy your pod with the specific label
chaos.datadoghq.com/disrupt-on-init
to hold it in the initialization state- the chaos-controller will inject an init containers name
chaos-handler
as the first init container in your pod - this init container is lightweight and does nothing but waiting for a
SIGUSR1
signal to complete successfully
- the chaos-controller will inject an init containers name
- apply your disruption with the init mode on
- the chaos pod will inject the disruption and unstuck your pod from the pending state
Note that in this mode, only pending pods with a running chaos-handler
init container and matching your labels + the special label specified above will be targeted. The chaos-handler
init container will automatically exit and fail if no signal is received within the specified timeout (default is 1 minute).
When creating a disruption, you may wish to be alerted of important lifecycle warnings (disruption found no target, chaos pod is stuck on removal, target is failing, target is recovering, etc.) through the Notifier module of the chaos-controller. On each occurence, these events will be propagated through the different set up notifiers (currently noop/console
, slack
and datadog
are implemented).
You can find the complete list of the events sent out by the controller here.
Any setup/config error will be logged at controller startup.
The slack
notifier requires a slack API Token to connect to your org's slack workspace. It will use the disruption's creator username in kubernetes (based on your authentication method) as an email address to send a DM on slack as 'Disruption Status Bot'. The email address used to authentify on the kubernetes cluster and create the disruption needs to be the same used on the slack workspace or the notification will be ignored.
The datadog
notifier requires the STATSD_URL
environment variable to be set up. It will either send a Warn
event for warning kubernetes events or a Success
event for normal recovered kubernetes events sent out by the controller.
The http
notifier requires a URL
to send the POST request to and optionally ask for either the list of headers in the configmap or the filepath of a file containing the list of headers to add to the request if needed. It will send a json body containing the notification information.
Note that the list of headers from the configmap will take prevalence over the list of headers found in the file: if there are conflictings headers in both of those lists, the one from the configmap will be kept.
The list is of format:
key1:value1
key2:value2
Please setup the following fields to chart/templates/configmap.yaml - data - config.yaml - controller
pre-controller installation:
notifiers:
common:
clusterName: <cluster name> # will be n/a otherwise
noop:
enabled: true/false # enables the noop notifier
slack:
enabled: true/false # enables the slack notifier
tokenFilepath: <slack token file path> # path to a file containing an API token for your slack workspace
datadog:
enabled: true/false # enables the datadog notifier
http:
enabled: true/false # enables the http notifier
url: <url>
headers: # optional, list of headers to add to the http POST request we send
- "Authorization:Bearer token"
headersFilepath: <headers file path> # optional, path to a file containing the list of headers to add to the http POST request we send for the http notifier
On top of global notifiers configuration sent to user privately and potentially mirrored to a common slack channel, you may want to send notifications to a dedicated channel on a per disruption basis, to enable your team to be notified of an on-going disruption as an example.
In order to activate such capability, you will need to:
- provide the
reporting
field on a disruption spec - add the slack bot to your slack workspace
- add the slack bot to the expected channel(s)
- configure
chaos-controller
slack notifier with a slack token and enable it
reporting: # optional, add custom notification for this disruption
slackChannel: team-slack-channel # required, custom slack channel to send notifications to (can be a name or slack channel ID)
purpose:
| # required, purpose/contextual informations to explain reasons of the disruption launch, can contain markdown formatting
*full network drop*: _aims to validate retry capabilities of demo-curl_. Contact #team-test for more informations.
notificationTypeLevel: Info # optional, minimal notification type to be notified, default is Success, available options are Info, Success, Warning, Error
Please take a look at the different disruptions documentation linked in the table of content for more information about what they can do and how to use them.
Here is a full example of the disruption resource with comments. You can also have a look at the following use cases with examples of disruptions you can adapt and apply as you wish:
- Node disruptions
- Pod disruptions
- Network disruptions
- I want to drop packets going out from my pods
- I want to corrupt packets going out from my pods
- I want to add network latency to packets going out from my pods
- I want to restrict the outgoing bandwidth of my pods
- I want to disrupt packets going to a specific host, port or Kubernetes service
- I want to disrupt packets going to a specific cloud managed service
- CPU pressure
- Disk pressure
- DNS resolution mocking
- Network and DNS disruptions