DevOps in K8s — QoS Deep Dive
DevOps in K8s bootcamp series

How Does QoS Work?
K8s has revolutionized the way we think about and manage containerized applications. While it provides numerous benefits such as scalability, resiliency, and flexible deployment, understanding the intricacies of how it schedules and manages pods is vital for optimal performance. One such aspect that demands attention is memory management, especially when a system runs into Out of Memory (OOM) scenarios.
First and foremost, it’s crucial to understand that during pod scheduling, K8s’ scheduler primarily considers the requests value. This ensures that resources are allocated efficiently based on the requirements stated by each pod.
However, memory management doesn’t end at efficient allocation. Handling scenarios where the system runs out of memory is equally important. This is where OOMScore comes into play. In essence, OOMScore is an indicator related to memory. It assists the system in deciding which processes to terminate first when faced with memory shortages.
You might wonder, how does the system determine which processes to prioritize during OOM scenarios? The answer lies in the OOMScore value of each process. A process’s OOMScore can be checked using the command cat /proc/$PID/oom_score. The value range for this score spans from -1000 to 1000.
Default OOMScores
- Guaranteed Pods: These have an OOMScore defaulting to -998. As the name suggests, these pods are given a higher guarantee in terms of resource allocation and are least likely to be terminated during OOM scenarios.
- Burstable Pods: Their OOMScore lies between 2 and 999. They have a moderate level of resource guarantee, making them more prone to termination during OOM situations compared to Guaranteed pods but less so than BestEffort pods.
- BestEffort Pods: With an OOMScore of 1000, these pods receive the lowest priority in resource allocation. Thus, they are the first candidates for termination when the system faces memory shortage.
QoS and CGroups
K8s employs cgroups to determine the Quality of Service (QoS) for its pods. In the kubelet configuration, there’s an option --cgroups-per-qos which is turned on by default. Once activated, it facilitates the creation of specific level cgroups for varied QoS classifications.
Within these QoS-based cgroups, additional levels are fashioned for every container in a pod, thereby organizing and restricting resources in a step-by-step manner from QoS, to the pod, and then to the container.
When utilizing the “containerd” runtime, as opposed to the traditional docker, the path for cgroups slightly changes:
- For the Guaranteed category, the cgroup is established at:
RootCgroup/system.slice/containerd.service/kubepods-pod<uid>.slice:cri-containerd:<container-id> - The Burstable category gets its cgroup at:
RootCgroup/system.slice/containerd.service/kubepods-burstable-pod<uid>.slice:cri-containerd:<container-id> - The BestEffort type is assigned to:
RootCgroup/system.slice/containerd.service/kubepods-besteffort-pod<uid>.slice:cri-containerd:<container-id>
You can check the RootCgroup using the mount command:
$ mount | grep cgroup
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)Under each subsystem of cgroup, QoS level cgroups are created. Additionally, within the respective QoS level cgroups, Pod level cgroups are also established for the pod. For instance, when we create a Pod as illustrated below:
# qos-demo.yaml
apiVersion: v1
kind: Pod
metadata:
name: qos-demo
spec:
containers:
- name: nginx
image: nginx:latest
resources:
requests:
cpu: 250m
memory: 1Gi
limits:
cpu: 500m
memory: 2Giand create the above Pod
$ ubectl apply -f qos-demo.yaml
$ kubectl get pods qos-demo -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
qos-demo 1/1 Running 0 5m10s 10.244.1.22 node1 <none> <none>
$ kubectl get pods qos-demo -o yaml |grep uid
uid: 489a19f2-8d75-474c-976f-5854b61b926c
$ kubectl get pods qos-demo -o yaml |grep qosClass
qosClass: BurstableSince the resource settings for this pod have requests not equal to limits, it is classified as a Burstable category pod. The kubelet will create the cgroup level at RootCgroup/system.slice/containerd.service/kubepods-burstable-pod<uid>.slice:cri-containerd:<container-id> under its corresponding QoS. For example, when we inspect the memory subsystem's cgroup:
$ ls /sys/fs/cgroup/memory/system.slice/containerd.service/kubepods-burstable-podxxxx.slice:cri-containerd:xxxx
cgroup.clone_children memory.kmem.tcp.max_usage_in_bytes memory.oom_control
cgroup.event_control memory.kmem.tcp.usage_in_bytes memory.pressure_level
cgroup.procs memory.kmem.usage_in_bytes memory.soft_limit_in_bytes
memory.failcnt memory.limit_in_bytes memory.stat
memory.force_empty memory.max_usage_in_bytes memory.swappiness
memory.kmem.failcnt memory.memsw.failcnt memory.usage_in_bytes
memory.kmem.limit_in_bytes memory.memsw.limit_in_bytes memory.use_hierarchy
memory.kmem.max_usage_in_bytes memory.memsw.max_usage_in_bytes notify_on_release
memory.kmem.slabinfo memory.memsw.usage_in_bytes tasks
memory.kmem.tcp.failcnt memory.move_charge_at_immigrate
memory.kmem.tcp.limit_in_bytes memory.numa_statIf we check memory.limit_in_bytes :
$ cat memory.limit_in_bytes
2147483648 # 2147483648 / 1024 / 1024 / 1024 = 2Similarly, for cpu, we can find the following:
ls /sys/fs/cgroup/cpu/system.slice/containerd.service/kubepods-burstable-podxxxx.slice:cri-containerd:xxxx
cgroup.clone_children cpuacct.stat cpu.cfs_period_us cpu.rt_runtime_us notify_on_release
cgroup.event_control cpuacct.usage cpu.cfs_quota_us cpu.shares tasks
cgroup.procs cpuacct.usage_percpu cpu.rt_period_us cpu.statand if you check cfs_quota_us :
$ cat cpu.cfs_quota_us
50000 # 500m




