Summary

The provided content discusses the Quality of Service (QoS) in Kubernetes (K8s) with a focus on memory management and OOM (Out of Memory) scenarios, detailing how K8s uses QoS classes and cgroups to handle resource allocation and prioritization.

Abstract

Kubernetes has transformed the deployment and management of containerized applications by providing scalability, resilience, and efficient resource allocation. Central to this is the management of memory resources, particularly in OOM scenarios. K8s employs QoS classes—Guaranteed, Burstable, and BestEffort—each with a default OOMScore to determine the priority of pod termination during memory shortages. The kubelet uses cgroups to enforce these QoS levels, creating hierarchical cgroups for each QoS class and container. This structure allows for precise control over resource distribution, ensuring that Guaranteed pods have the highest resource allocation guarantee, while BestEffort pods are the first to be terminated in low-memory conditions. The article also illustrates how to inspect cgroup settings and demonstrates the creation of a Burstable pod with its corresponding cgroup configuration.

Opinions

The author emphasizes the importance of understanding K8s' memory management and QoS to ensure optimal performance and efficient resource allocation.
The use of OOMScore is presented as a critical mechanism for K8s to handle memory scarcity, with the system's ability to terminate lower-priority processes being key to maintaining overall system stability.
The article suggests that the default QoS settings and cgroup configurations provided by K8s are sufficient for most use cases, but also implies that advanced users can benefit from understanding and customizing these settings for their specific needs.
There is an underlying preference for using the containerd runtime over the traditional docker, as evidenced by the detailed explanation of cgroup paths specific to containerd.
The author provides a practical example of creating a pod with specific resource requests and limits, highlighting the real-world application of the concepts discussed.

DevOps in K8s — QoS Deep Dive

DevOps in K8s bootcamp series

How Does QoS Work?

K8s has revolutionized the way we think about and manage containerized applications. While it provides numerous benefits such as scalability, resiliency, and flexible deployment, understanding the intricacies of how it schedules and manages pods is vital for optimal performance. One such aspect that demands attention is memory management, especially when a system runs into Out of Memory (OOM) scenarios.

First and foremost, it’s crucial to understand that during pod scheduling, K8s’ scheduler primarily considers the requests value. This ensures that resources are allocated efficiently based on the requirements stated by each pod.

However, memory management doesn’t end at efficient allocation. Handling scenarios where the system runs out of memory is equally important. This is where OOMScore comes into play. In essence, OOMScore is an indicator related to memory. It assists the system in deciding which processes to terminate first when faced with memory shortages.

You might wonder, how does the system determine which processes to prioritize during OOM scenarios? The answer lies in the OOMScore value of each process. A process’s OOMScore can be checked using the command cat /proc/$PID/oom_score. The value range for this score spans from -1000 to 1000.

Default OOMScores

Guaranteed Pods: These have an OOMScore defaulting to -998. As the name suggests, these pods are given a higher guarantee in terms of resource allocation and are least likely to be terminated during OOM scenarios.
Burstable Pods: Their OOMScore lies between 2 and 999. They have a moderate level of resource guarantee, making them more prone to termination during OOM situations compared to Guaranteed pods but less so than BestEffort pods.
BestEffort Pods: With an OOMScore of 1000, these pods receive the lowest priority in resource allocation. Thus, they are the first candidates for termination when the system faces memory shortage.

QoS and CGroups

K8s employs cgroups to determine the Quality of Service (QoS) for its pods. In the kubelet configuration, there’s an option --cgroups-per-qos which is turned on by default. Once activated, it facilitates the creation of specific level cgroups for varied QoS classifications.

Within these QoS-based cgroups, additional levels are fashioned for every container in a pod, thereby organizing and restricting resources in a step-by-step manner from QoS, to the pod, and then to the container.

When utilizing the “containerd” runtime, as opposed to the traditional docker, the path for cgroups slightly changes:

For the Guaranteed category, the cgroup is established at: RootCgroup/system.slice/containerd.service/kubepods-pod<uid>.slice:cri-containerd:<container-id>
The Burstable category gets its cgroup at: RootCgroup/system.slice/containerd.service/kubepods-burstable-pod<uid>.slice:cri-containerd:<container-id>
The BestEffort type is assigned to: RootCgroup/system.slice/containerd.service/kubepods-besteffort-pod<uid>.slice:cri-containerd:<container-id>

You can check the RootCgroup using the mount command:

$ mount | grep cgroup
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)

Under each subsystem of cgroup, QoS level cgroups are created. Additionally, within the respective QoS level cgroups, Pod level cgroups are also established for the pod. For instance, when we create a Pod as illustrated below:

# qos-demo.yaml
apiVersion: v1
kind: Pod
metadata:
  name: qos-demo
spec:
  containers:
  - name: nginx
    image: nginx:latest
    resources:
      requests:
        cpu: 250m
        memory: 1Gi
      limits:
        cpu: 500m
        memory: 2Gi

and create the above Pod

$ ubectl apply -f qos-demo.yaml
$ kubectl get pods qos-demo -o wide
NAME       READY   STATUS    RESTARTS   AGE     IP            NODE    NOMINATED NODE   READINESS GATES
qos-demo   1/1     Running   0          5m10s   10.244.1.22   node1   <none>           <none>
$ kubectl get pods qos-demo -o yaml |grep uid
uid: 489a19f2-8d75-474c-976f-5854b61b926c
$ kubectl get pods qos-demo -o yaml |grep qosClass
qosClass: Burstable

Since the resource settings for this pod have requests not equal to limits, it is classified as a Burstable category pod. The kubelet will create the cgroup level at RootCgroup/system.slice/containerd.service/kubepods-burstable-pod<uid>.slice:cri-containerd:<container-id> under its corresponding QoS. For example, when we inspect the memory subsystem's cgroup:

$ ls /sys/fs/cgroup/memory/system.slice/containerd.service/kubepods-burstable-podxxxx.slice:cri-containerd:xxxx
cgroup.clone_children           memory.kmem.tcp.max_usage_in_bytes  memory.oom_control
cgroup.event_control            memory.kmem.tcp.usage_in_bytes      memory.pressure_level
cgroup.procs                    memory.kmem.usage_in_bytes          memory.soft_limit_in_bytes
memory.failcnt                  memory.limit_in_bytes               memory.stat
memory.force_empty              memory.max_usage_in_bytes           memory.swappiness
memory.kmem.failcnt             memory.memsw.failcnt                memory.usage_in_bytes
memory.kmem.limit_in_bytes      memory.memsw.limit_in_bytes         memory.use_hierarchy
memory.kmem.max_usage_in_bytes  memory.memsw.max_usage_in_bytes     notify_on_release
memory.kmem.slabinfo            memory.memsw.usage_in_bytes         tasks
memory.kmem.tcp.failcnt         memory.move_charge_at_immigrate
memory.kmem.tcp.limit_in_bytes  memory.numa_stat

If we check memory.limit_in_bytes :

$ cat memory.limit_in_bytes
2147483648 # 2147483648 / 1024 / 1024 / 1024 = 2

Similarly, for cpu, we can find the following:

ls /sys/fs/cgroup/cpu/system.slice/containerd.service/kubepods-burstable-podxxxx.slice:cri-containerd:xxxx
cgroup.clone_children  cpuacct.stat          cpu.cfs_period_us  cpu.rt_runtime_us  notify_on_release
cgroup.event_control   cpuacct.usage         cpu.cfs_quota_us   cpu.shares         tasks
cgroup.procs           cpuacct.usage_percpu  cpu.rt_period_us   cpu.stat

and if you check cfs_quota_us :

$ cat cpu.cfs_quota_us
50000  # 500m