Merge pull request kata-containers#543 from jcvenegas/SandboxCgroupOn…

…ly-docs docs: Add documentation about host cgroup management
jjhegg · Sep 9, 2019 · 89120e8 · 89120e8
2 parents 44f67f7 + 2255b36
commit 89120e8
Show file tree

Hide file tree

Showing 5 changed files with 220 additions and 92 deletions.
diff --git a/Limitations.md b/Limitations.md
@@ -138,7 +138,7 @@ these commands is potentially challenging.
 See issue https://github.com/clearcontainers/runtime/issues/341 and [the constraints challenge](#the-constraints-challenge) for more information.
 
 For CPUs resource management see
-[CPU constraints](design/cpu-constraints.md).
+[CPU constraints](design/vcpu-handling.md).
 
 ### docker run and shared memory
 

diff --git a/design/README.md b/design/README.md
@@ -6,3 +6,5 @@ Kata Containers design documents:
 - [API Design of Kata Containers](kata-api-design.md)
 - [Design requirements for Kata Containers](kata-design-requirements.md)
 - [VSocks](VSocks.md)
+- [VCPU handling](vcpu-handling.md)
+- [Host cgroups](host-cgroups.md)
diff --git a/design/VSocks.md b/design/VSocks.md
@@ -130,5 +130,5 @@ the containers are removed automatically.
 [2]: https://github.com/kata-containers/proxy
 [3]: https://github.com/hashicorp/yamux
 [4]: https://wiki.qemu.org/Features/VirtioVsock
-[5]: ./cpu-constraints.md#virtual-cpus-and-kubernetes-pods
+[5]: ./vcpu-handling.md#virtual-cpus-and-kubernetes-pods
 [6]: https://github.com/kata-containers/shim
diff --git a/design/host-cgroups.md b/design/host-cgroups.md
@@ -0,0 +1,208 @@
+- [Host cgroup management](#host-cgroup-management)
+  - [Introduction](#introduction)
+  - [`SandboxCgroupOnly` enabled](#sandboxcgrouponly-enabled)
+    - [What does Kata do in this configuration?](#what-does-kata-do-in-this-configuration)
+    - [Why create a Kata-cgroup under the parent cgroup?](#why-create-a-kata-cgroup-under-the-parent-cgroup)
+    - [Improvements](#improvements)
+  - [`SandboxCgroupOnly` disabled (default, legacy)](#sandboxcgrouponly-disabled-default-legacy)
+    - [What does this method do?](#what-does-this-method-do)
+      - [Impact](#impact)
+  - [Summary](#summary)
+
+# Host cgroup management
+
+## Introduction
+
+In Kata Containers, workloads run in a virtual machine that is managed by a virtual
+machine monitor (VMM) running on the host. As a result, Kata Containers run over two layers of cgroups. The
+first layer is in the guest where the workload is placed, while the second layer is on the host where the
+VMM and associated threads are running.
+
+The OCI [runtime specification][linux-config] provides guidance on where the container cgroups should be placed:
+
+  > [`cgroupsPath`][cgroupspath]: (string, OPTIONAL) path to the cgroups. It can be used to either control the cgroups
+  > hierarchy for containers or to run a new process in an existing container
+
+cgroups are hierarchical, and this can be seen with the following pod example:
+
+- Pod 1: `cgroupsPath=/kubepods/pod1`
+  - Container 1:
+`cgroupsPath=/kubepods/pod1/container1`
+  - Container 2:
+`cgroupsPath=/kubepods/pod1/container2`
+
+- Pod 2: `cgroupsPath=/kubepods/pod2`
+  - Container 1:
+`cgroupsPath=/kubepods/pod2/container2`
+  - Container 2:
+`cgroupsPath=/kubepods/pod2/container2`
+
+Depending on the upper-level orchestrator, the cgroup under which the pod is placed is
+managed by the orchestrator. In the case of Kubernetes, the pod-cgroup is created by Kubelet,
+while the container cgroups are to be handled by the runtime. Kubelet will size the pod-cgroup
+based on the container resource requirements.
+
+Kata Containers introduces a non-negligible overhead for running a sandbox (pod). Based on this, two scenarios are possible:
+ 1) The upper-layer orchestrator takes the overhead of running a sandbox into account when sizing the pod-cgroup, or
+ 2) Kata Containers do not fully constrain the VMM and associated processes, instead placing a subset of them outside of the pod-cgroup.
+
+Kata Containers provides two options for how cgroups are handled on the host. Selection of these options is done through
+the `SandboxCgroupOnly` flag within the Kata Containers [configuration](https://github.com/kata-containers/runtime#configuration)
+file.
+
+## `SandboxCgroupOnly` enabled
+
+With `SandboxCgroupOnly` enabled, it is expected that the parent cgroup is sized to take the overhead of running
+a sandbox into account. This is ideal, as all the applicable Kata Containers components can be placed within the
+given cgroup-path.
+
+In the context of Kubernetes, Kubelet will size the pod-cgroup to take the overhead of running a Kata-based sandbox
+into account. This will be feasible in the 1.16 Kubernetes release through the `PodOverhead` feature.
+
+```
++----------------------------------------------------------+
+|    +---------------------------------------------------+ |
+|    |   +---------------------------------------------+ | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |   | kata-shimv2, VMM and threads:        |  | | |
+|    |   |   |  (VMM, IO-threads, vCPU threads, etc)|  | | |
+|    |   |   |                                      |  | | |
+|    |   |   | kata-sandbox-<id>                    |  | | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |                                             | | |
+|    |   |Pod 1                                        | | |
+|    |   +---------------------------------------------+ | |
+|    |                                                   | |
+|    |   +---------------------------------------------+ | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |   | kata-shimv2, VMM and threads:        |  | | |
+|    |   |   |  (VMM, IO-threads, vCPU threads, etc)|  | | |
+|    |   |   |                                      |  | | |
+|    |   |   | kata-sandbox-<id>                    |  | | |
+|    |   |   +--------------------------------------+  | | |  
+|    |   |Pod 2                                        | | |
+|    |   +---------------------------------------------+ | |
+|    |kubepods                                           | |
+|    +---------------------------------------------------+ |
+|                                                          |
+|Node                                                      |
++----------------------------------------------------------+
+```
+
+### What does Kata do in this configuration?
+1. Given a `PodSandbox` container creation, let:
+
+   ```
+   podCgroup=Parent(container.CgroupsPath)
+   KataSandboxCgroup=<podCgroup>/kata-sandbox-<PodSandboxID>
+   ```
+
+2. Create the cgroup, `KataSandboxCgroup`
+
+3. Join the `KataSandboxCgroup`
+
+Any process created by the runtime will be created in `KataSandboxCgroup`.
+The runtime will not limit the cgroup in the host, but the caller is free
+to set the proper limits for the `podCgroup`.
+
+In the example above the pod cgroups are `/kubepods/pod1` and `/kubepods/pod2`.
+Kata creates the unrestricted sandbox cgroup under the pod cgroup.
+
+### Why create a Kata-cgroup under the parent cgroup?
+
+`Docker` does not have a notion of pods, and will not create a cgroup directory
+to place a particular container in (i.e., all containers would be in a path like
+`/docker/container-id`. To simplify the implementation and continue to support `Docker`,
+Kata Containers creates the sandbox-cgroup, in the case of Kubernetes, or a container cgroup, in the case
+of docker.
+
+### Improvements
+
+- Get statistics about pod resources
+
+If the Kata caller wants to know the resource usage on the host it can get
+statistics from the pod cgroup. All cgroups stats in the hierarchy will include
+the Kata overhead. This gives the possibility of gathering usage-statics at the
+pod level and the container level.
+
+- Better host resource isolation
+
+Because the Kata runtime will place all the Kata processes in the pod cgroup,
+the resource limits that the caller applies to the pod cgroup will affect all
+processes that belong to the Kata sandbox in the host. This will improve the
+isolation in the host preventing Kata to become a noisy neighbor.
+
+## `SandboxCgroupOnly` disabled (default, legacy)
+
+If the cgroup provided to Kata is not sized appropriately, instability will be
+introduced when fully constraining Kata components, and the user-workload will
+see a subset of resources that were requested. Based on this, the default
+handling for Kata Containers is to not fully constrain the VMM and Kata
+components on the host.
+
+```
++----------------------------------------------------------+
+|    +---------------------------------------------------+ |
+|    |   +---------------------------------------------+ | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |   |Container 1       |-|Container 2      |  | | |
+|    |   |   |                  |-|                 |  | | |
+|    |   |   | Shim+container1  |-| Shim+container2 |  | | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |                                             | | |
+|    |   |Pod 1                                        | | |
+|    |   +---------------------------------------------+ | |
+|    |                                                   | |
+|    |   +---------------------------------------------+ | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |   |Container 1       |-|Container 2      |  | | |
+|    |   |   |                  |-|                 |  | | |
+|    |   |   | Shim+container1  |-| Shim+container2 |  | | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |                                             | | |
+|    |   |Pod 2                                        | | |
+|    |   +---------------------------------------------+ | |
+|    |kubepods                                           | |
+|    +---------------------------------------------------+ |
+|    +---------------------------------------------------+ |
+|    |  Hypervisor                                       | |
+|    |Kata                                               | |
+|    +---------------------------------------------------+ |
+|                                                          |
+|Node                                                      |
++----------------------------------------------------------+
+
+```
+
+### What does this method do?
+
+1. Given a container creation let `containerCgroupHost=container.CgroupsPath`
+1. Rename `containerCgroupHost` path to add `kata_`
+1. Let `PodCgroupPath=PodSanboxContainerCgroup` where `PodSanboxContainerCgroup` is the cgroup of a container of type `PodSandbox`
+1. Limit the `PodCgroupPath` with the sum of all the container limits in the Sandbox
+1. Move only vCPU threads of hypervisor to `PodCgroupPath`
+1. Per each container, move its `kata-shim` to its own `containerCgroupHost`
+1. Move hypervisor and applicable threads to memory cgroup `/kata`
+
+_Note_: the Kata Containers runtime will not add all the hypervisor threads to
+the cgroup path requested, only vCPUs. These threads are run unconstrained.
+
+This mitigates the risk of the VMM and other threads receiving an out of memory scenario (`OOM`).
+
+
+#### Impact
+
+If resources are reserved at a system level to account for the overheads of
+running sandbox containers, this configuration can be utilized with adequate
+stability. In this scenario, non-negligible amounts of CPU and memory will be
+utilized unaccounted for on the host.
+
+[linux-config]: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md
+[cgroupspath]: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#cgroups-path
+
+## Summary
+
+| cgroup option | default? | status | pros | cons
+|-|-|-|-|-|
+| `SandboxCgroupOnly=false` | yes | legacy | Easiest to make Kata work | Unaccounted for memory and resource utilization
+| `SandboxCgroupOnly=true` | no | recommended | Complete tracking of Kata memory and CPU utilization. In Kubernetes, the Kubelet can fully constrain Kata via the pod cgroup | Requires upper layer orchestrator which sizes sandbox cgroup appropriately |
diff --git a/design/cpu-constraints.md → design/vcpu-handling.md b/design/cpu-constraints.md → design/vcpu-handling.md
@@ -1,17 +1,12 @@
-* [CPU constraints in Kata Containers](#cpu-constraints-in-kata-containers)
-    * [Default number of virtual CPUs](#default-number-of-virtual-cpus)
-    * [Virtual CPUs and Kubernetes pods](#virtual-cpus-and-kubernetes-pods)
-    * [Container lifecycle](#container-lifecycle)
-    * [Container without CPU constraint](#container-without-cpu-constraint)
-    * [Container with CPU constraint](#container-with-cpu-constraint)
-    * [Do not waste resources](#do-not-waste-resources)
-    * [CPU cgroups](#cpu-cgroups)
-    * [cgroups in the guest](#cgroups-in-the-guest)
-        * [CPU pinning](#cpu-pinning)
-    * [cgroups in the host](#cgroups-in-the-host)
+- [Virtual machine vCPU sizing in Kata Containers](#virtual-machine-vcpu-sizing-in-kata-containers)
+  * [Default number of virtual CPUs](#default-number-of-virtual-cpus)
+  * [Virtual CPUs and Kubernetes pods](#virtual-cpus-and-kubernetes-pods)
+  * [Container lifecycle](#container-lifecycle)
+  * [Container without CPU constraint](#container-without-cpu-constraint)
+  * [Container with CPU constraint](#container-with-cpu-constraint)
+  * [Do not waste resources](#do-not-waste-resources)
 
-
-# CPU constraints in Kata Containers
+# Virtual machine vCPU sizing in Kata Containers
 
 ## Default number of virtual CPUs
 
@@ -171,83 +166,6 @@ docker run --cpus 4 -ti debian bash -c "nproc; cat /sys/fs/cgroup/cpu,cpuacct/cp
 ```
 
 
-## CPU cgroups
-
-Kata Containers runs over two layers of cgroups, the first layer is in the guest where
-only the workload is placed, the second layer is in the host that is more complex and
-might contain more than one process and task (thread) depending of the number of
-containers per POD and vCPUs per container. The following diagram represents a Nginx container
-created with `docker` with the default number of vCPUs.
-
-
-```
-$ docker run -dt --runtime=kata-runtime nginx
-
-
-       .-------.
-       | Nginx |
-    .--'-------'---.  .------------.
-    | Guest Cgroup |  | Kata agent |
-  .-'--------------'--'------------'.    .-----------.
-  |  Thread: Hypervisor's vCPU 0    |    | Kata Shim |
- .'---------------------------------'.  .'-----------'.
- |             Tasks                 |  |  Processes  |
-.'-----------------------------------'--'-------------'.
-|                    Host Cgroup                       |
-'------------------------------------------------------'
-```
-
-The next sections explain the difference between processes and tasks and why only hypervisor
-vCPUs are constrained.
-
-### cgroups in the guest
-
-Only the workload process including all its threads are placed into CPU cgroups, this means
-that `kata-agent` and `systemd` run without constraints in the guest.
-
-#### CPU pinning
-
-Kata Containers tries to apply and honor the cgroups but sometimes that is not possible.
-An example of this occurs with CPU cgroups when the number of virtual CPUs (in the guest)
-does not match the actual number of physical host CPUs.
-In Kata Containers to have a good performance and small memory footprint, the resources are
-hot added when they are needed, therefore the number of virtual resources is not the same
-as the number of physical resources. The problem with this approach is that it's not possible
-to pin a process on a specific resource that is not present in the guest. To deal with this
-limitation and to not fail when the container is being created, Kata Containers does not apply
-the constraint in the first layer (guest) if the resource does not exist in the guest, but it
-is applied in the second layer (host) where the hypervisor is running. The constraint is applied
-in both layers when the resource is available in the guest and host. The next sections provide
-further details on what parts of the hypervisor are constrained.
-
-### cgroups in the host
-
-In Kata Containers the workloads run in a virtual machine that is managed and represented by a
-hypervisor running in the host. Like other processes the hypervisor might use threads to realize
-several tasks, for example IO and Network operations. One of the most important uses for the
-threads is as vCPUs. The processes running in the guest see these vCPUs as physical CPUs, while
-in the host those vCPU are just threads that are part of a process. This is the key to ensure
-workloads consumes only the amount of CPU resources that were assigned to it without impacting
-other operations. From user perspective the easier approach to implement it would be to take the
-whole hypervisor including its threads and move them into the cgroup, unfortunately this will
-impact negatively the performance, since vCPUs, IO and Network threads will be fighting for
-resources. The following table shows a random read performance comparison between a Kata Container
-with all its hypervisor threads in the cgroup and other with only its hypervisor vCPU threads
-constrained, the difference is huge.
-
-
-| Bandwidth     | All threads   | vCPU threads | Units |
-|:-------------:|:-------------:|:------------:|:-----:|
-| 4k            | 136.2         | 294.7        | MB/s  |
-| 8k            | 166.6         | 579.4        | MB/s  |
-| 16k           | 178.3         | 1093.3       | MB/s  |
-| 32k           | 179.9         | 1931.5       | MB/s  |
-| 64k           | 213.6         | 3994.2       | MB/s  |
-
-
-To have the best performance in Kata Containers only the vCPU threads are constrained.
-
-
 [1]: https://docs.docker.com/config/containers/resource_constraints/#cpu
 [2]: https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource
 [3]: https://kubernetes.io/docs/concepts/workloads/pods/pod/