The most critical Kubernetes metrics to monitor
Kubernetes. Nowadays, it seems companies in the industry are divided into two pools: those that already use it heavily for their production workloads and those that are migrating their workloads into it.
The issue with Kubernetes is that it is not a single system the way Redis RabbitMQ or PostgreSQL are. It is a combination of several control plane components (for example, etcd, API server) that run our workloads on the user (data) plane over a fleet of VMs.
The number of metrics coming out of control plane components, VMs, and your workloads might be overwhelming at first glance. Forming a comprehensive observability stack out of those metrics requires decent knowledge and experience with managing Kubernetes clusters.
So how can you handle the flood of metrics? Reading this post might be a good place to start.
We’ll cover the most critical metrics based on k8s’s metadata that form a good baseline for monitoring your workloads and ensuring they’re in a healthy state. In order to have these metrics available, you’ll need to install kube-state-metrics and Prometheus to scrape the metrics it exposes and store them for querying later on.
We’re not going to cover the installation process here, but a good lead is the Prometheus Helm Chart which installs both with default settings.
For each of the listed metrics, we’ll cover what the metric signifies, why you should care about it, and how you should set your alerts within it.
What: Every container can (and should!) define requests for CPU and memory. The Kubernetes scheduler is using these requests to make sure it selects a node that has the capacity to host the pod. It does that by calculating the unused resources on the node considering its capacity minus the current scheduled pods’ requests.
Let’s look at an example to make this clearer: Say you have a node with eight CPU cores running three pods, and each pod has a single container that requests one CPU core. The node has five unreserved CPU cores for the scheduler to work with when it is assigning pods.
Keep in mind that by “available” we’re not referring to actual usage but rather to CPU cores that haven’t been requested (reserved) by pods currently scheduled into the node. A pod that requires six CPU cores won’t be scheduled into this node since there are not enough available CPU cores to host it.
The “actual usage” metric tracks how much of a resource the pod uses during runtime. When we measure actual usage, it is usually across a fleet of pods (deployment,
statefulset, etc.), so we should refer to a percentile rather than a single pod’s usage. The 90th percentile should be a good starting point for this matter.
For example, a deployment requiring 1 CPU core per pod might use 0.7 cores in the 90th percentile across its replicas.
Why: Keeping the requests and the actual usage aligned is important. Requests higher than the actual usage lead to inefficient resource usage (underutilization). Think of what happens when a pod that requests four cores uses one core in the 90th percentile.
K8s might schedule this pod into a node with four free cores which means no other pod will be able to use the reserved three cores that are not in use. The diagram below shows that each pod reserved four cores but actually uses a single core, meaning we’re “wasting” six cores on the node. They’ll remain unused.
The same goes for memory. If we set the request higher than the usage, we’ll end up not using available memory.
The other option for misalignment is that the pod’s requests are lower than its actual usage (overutilization). In case of CPU overutilization, your applications will work slower due to insufficient resources on the node.
Imagine three pods. Each of them requests one core but actually uses three. These three pods might be scheduled into an 8-core machine (1 request * 3 =3<8), but when they do, they’ll compete for CPU time since their actual usage — 9 cores — exceeds the number of cores on the node.
While with CPU, you would experience slow application execution when memory requests are lower than required, you might get into other kinds of issues.
If we have three pods, and each requests 1 GB of memory but uses 3 GB, they might all get scheduled into a node with 8GB of memory. On runtime, when a process tries to allocate more memory than the node has, it will get OOMKilled (Out Of Memory Killed) by the kernel, and with K8s, it will restart.
When our process gets OOMKilled, it will probably lose any inflight requests and be unavailable until it boots back up, which leaves us under capacity. And once it has booted, it might suffer from a cold start due to cold caches or empty connection pools to its dependencies (databases, other services, etc.).
How: Let’s define the pod requests as 100%. A sane range for actual usage (CPU or memory, it doesn’t really matter) would be 60%–80% in the 90th percentile.
For example, if you have a pod that requests 10GB of memory, its 90th percentile of actual usage should be 6GB-8GB. If it is lower than 6GB, you would be underusing your memory and wasting money. If it is higher than 8GB, you will get to a point where you’re risking getting OOMKilled due to insufficient memory. The same rule we applied for memory requests can be applied for CPU requests.
What: While the scheduler is using resource requests to schedule workloads into nodes, resource limits allow you to define boundaries for the resource usage of your workloads during runtime.
Why: It is very important to understand the way CPU and memory limits are being enforced so you are aware of the implications of your workloads crossing them:
When a container reaches the CPU limit, it will get throttled, meaning it would get fewer CPU cycles from the OS than it could have and that eventually results in slower execution time. It doesn’t matter if the node hosting the pod has free CPU cycles to spare or not — the docker runtime throttles the container.
It is very dangerous to be CPU throttled without being aware of it. Latencies of random flows in the system spike up, and it might be very hard to pinpoint the root cause if one of the components in the system is being throttled and you haven’t set the required observability beforehand. This situation could lead to partial service disruption or full unavailability in case the throttled service takes part in core flows on our system.
Memory limits are enforced differently than CPU limits: when your container reaches the memory limit, it will get OOMKilled. This has the same effect of being OOMKIlled due to insufficient memory on the node: the process will be dropping inflight requests, the service will be under capacity until the container restarts, and then it would have a cold start.
If the process accumulates memory fast enough, it might get into the
CrashLoop state. This state signals that the process is either crashing on boot or a short time after starting over and over again. Crashlooping pods usually translate to the unavailability of the service.
How: The way to monitor resource limits is similar to the way we monitor CPU/memory requests. You should aim for up to 80% actual usage out of the limit on the 90th percentile. For example, if we have a pod that has a CPU limit of two cores and a memory limit of 2GB, the alert should be set for 1.6 cores of CPU usage or 1.6GB of memory usage.
Anything above that introduces the risk of your workload being throttled or restarted according to the crossed threshold.
What: When you deploy an app, you set the number of desired replicas (pods) it should be running. Sometimes some of the pods might not be available due to several reasons, such as:
- Some pods might not fit any of the running nodes in the cluster due to their resource requests. These pods will transition into Pending state until either a node frees up resources to host them or a new node that meets the requirements joins the cluster.
- Some pods might not pass liveness/readiness probes meaning they are either restarting (liveness) or being taken out of the service endpoints (readiness).
- Some pods might reach their resource limits as mentioned above and get into Crashloop state.
- Some pods might be hosted on a malfunctioning node for various reasons, and if the node is not healthy most chances are the pods hosted on it won’t function well.
Why: Having pods unavailable is not a healthy state for your system. It may result anywhere from a minor service disruption to complete service unavailability depending on the percentage of unavailable pods out of the desired number of replicas, and the importance of the missing pods in core flows on your system.
How: The function we want to monitor here is the percentage of unavailable pods out of the desired number of pods. The exact percentage you should aim for in your KPIs depends on the criticality of the service and each of its pods in your system.
For some workloads, we might be OK with 5% of the pods being unavailable for a certain period as long as the system returns to a healthy state by itself and there’s no impact on customers. For some workloads, even one unavailable pod might become an issue. A good example of that would be
statefulsets in which each pod has its unique identity, and having it unavailable might not be acceptable.
What: Horizontal Pod Autoscaler (HPA) is a k8s resource that allows you to adjust the number of replicas a workload is running according to a target function you define. The common use case is to auto-scale by the average CPU usage of pods across a deployment compared to the CPU requests.
Why: When a deployment’s number of replicas reaches the maximum defined in the HPA, you might get a situation where you need more pods, but the HPA can’t scale up. The consequences might differ according to the
scale up function you’ve set. Here are two examples to shed more clarity:
- If the
scale upfunction uses CPU usage, then the existing pods’ CPU usage will increase to a point where they’ll reach their limit and get throttled (see bullet 2 for more on that). This eventually results in lower throughput for your system.
- If the
scale upfunction uses custom metrics like the number of pending messages in a queue, the queue might start filling with pending messages that will create a delay in your processing pipeline.
How: Monitoring this metric is pretty simple. You need to set an
X% threshold for the division of the current number of replicas by the HPA max replicas. A sane
X could be 85% to allow you to make the required changes before you hit the maximum.
Keep in mind that increasing the number of replicas might affect other parts of the system, so you might end up changing a lot more than an HPA configuration to enable this
scale up operation.
A classic example of that would be a database that hits its maximum connection limit when you increase the number of replicas and more pods try to connect to it. This is why taking a large enough buffer as preparation time makes a lot of sense in this case.
What: kubelet is a k8s agent that runs on each of the nodes on the cluster. Among its duties, the kubelet publishes a few metrics (called Node Conditions) to reflect the health status of the node it runs on:
Ready— True if the node is healthy and ready to accept pods
DiskPressure— True if the node’s disk is short of free storage
MemoryPressure— True if the node is low on memory
PIDPressure— True if there are too many processes running on the node
NetworkUnavailable— True if the network for the node is not correctly configured
A healthy node should report
True on the
Ready condition and
False on all other four conditions.
Why: If the
Ready condition turns negative, or any of the other conditions turns positive, it probably means some or all of the workloads running on that node are not functioning well, and this is something you should be aware of.
PIDPressure, the root causes are pretty clear— a process writes to disk / allocates memory / spawns processes at a rate node cannot sustain.
NetworkUnavailable conditions are a bit trickier and require further investigation to get to the bottom of the issue.
How: I’d start by expecting exactly 0 nodes to be unhealthy so that every node that becomes unhealthy would trigger an alert.
What: Persistent Volume (PV) is a k8s resource representing a block of storage that can be attached and detached to pods in the system. PV’s implementation is platform-specific.
For example, if your Kubernetes deployment is based on AWS, a PV would be represented by an EBS volume. As with every block storage, it has a capacity and might get filled with time.
Why: When a process uses a disk that has no free space, hell breaks loose as the failure might be symptomized in a million different ways, and the stack traces do not always lead to the root cause. Apart from saving you from a future failure, watching this metric could also be used for planning workloads that record and add data with time.
Prometheus is a great example of such a workload, as it writes data points to its time-series database, which is the amount of free space in the disk decreases. Since the rate Prometheus writes data at is pretty consistent, it is easy to use the PV utilization metric to forecast the time you would need to either delete old data or purchase more capacity for the disk.
How: The kubelet exposes both PV usage and capacity, so a simple division between them should do the trick to provide you with the PV utilization. It’s a bit hard to suggest a sane alert threshold as it really depends on the trajectory of the utilization graph but as a rule of thumb, give yourself at least two to three weeks' heads up before you deplete your PV storage.
As you already figured out, handling a Kubernetes cluster is not an easy task. There are tons of metrics available, and it requires a lot of expertise to pick the important ones.
Having a dashboard monitoring key metrics for your cluster could be used both as a preventive measure to avoid issues in the first place or as a tool to troubleshoot issues in your system once they sneak in.
Note: This blog has also been published in Komodor’s Tech Blog.