Tech
Kubernetes Node Internals — Part 4: Keeping It Alive
Part 4 of a 5-part series: heartbeats, probes, cAdvisor, node pressure, evictions, OOM kills, and the survival logic that keeps a node functioning.
Kubernetes Node Internals — Part 4: Keeping It Alive
Starting Pods is only half the story.
Once workloads are running, the node has to keep answering a harder question:
Can I keep these Pods healthy without destroying myself?
That is the day-to-day operational loop of a Kubernetes node.
It sends heartbeats. It runs health probes. It monitors resource usage. It detects pressure. It evicts Pods when necessary. And sometimes, when memory runs out, the Linux kernel makes the brutal decision first and only tells Kubernetes afterward.
This is the part of the system that feels most like operations rather than deployment.
Series roadmap
- Part 1 — The anatomy of a node
- Part 2 — Bootstrap and the secret handshake
- Part 3 — A pod is born
- Part 4 — Keeping the node alive
- Part 5 — CSI, volumes, and mounts on the node
The heartbeat — how a node stays alive
A node does not stay healthy merely because its machine is powered on.
The control plane needs ongoing proof that the node is still responsive.
In modern Kubernetes, that proof usually comes in two related forms:
- Node Lease renewals
- NodeStatus updates
Node Lease object
The lightweight heartbeat path uses a Lease object in the coordination.k8s.io API group.
The kubelet renews this Lease frequently, commonly around every 10 seconds by default.
Why use a Lease?
Because a full Node object update is relatively heavy. A Lease is small, cheap, and perfect for the narrow question:
"Is this node still here?"
NodeStatus updates
The kubelet also publishes NodeStatus updates, which contain more detailed information such as:
- conditions
- capacity
- allocatable resources
- addresses
- runtime details
These updates are heavier than Lease renewals, so they are not the preferred mechanism for frequent heartbeat-only signaling.
What happens when the heartbeat stops
If the control plane stops hearing from a node, it does not instantly assume disaster. It waits for a configured grace window.
If the heartbeat still does not return:
- the node may be marked
NotReadyorUnknown - controllers stop trusting it as healthy execution capacity
- Pods on that node may later be evicted or rescheduled depending on controller behavior and timeout settings
This is why a node failure can feel delayed from the user's perspective. Kubernetes tries not to overreact to very short network blips.
Heartbeat timeline — Lease vs NodeStatus cadence
The exact defaults can vary by version and configuration, but the mental model is stable: Lease is the lightweight pulse; NodeStatus is the richer health report.
Pod health checks
Once Pods are running, kubelet needs to know whether they are healthy enough to keep running or healthy enough to receive traffic.
That is where probes come in.
| Probe | What question it answers | What happens on failure |
|---|---|---|
livenessProbe | "Should this container be restarted?" | kubelet restarts the container |
readinessProbe | "Should this container receive traffic?" | Pod is removed from Service endpoints |
startupProbe | "Is this slow-starting app still allowed time to boot?" | Other probes are delayed until startup succeeds |
livenessProbe
Liveness is about survival.
If the probe fails repeatedly, kubelet assumes the container is unhealthy in a way that restart might fix: deadlock, stuck event loop, hung process, or some other bad steady state.
readinessProbe
Readiness is about traffic eligibility.
A container may be alive but not yet ready:
- cache warmup not finished
- migrations still running
- downstream dependency missing
- application intentionally draining before shutdown
When readiness fails, Kubernetes does not necessarily kill the container. It simply stops routing Service traffic to it.
startupProbe
Startup probes exist for slow-starting applications.
Without them, a liveness probe can accidentally kill a container that is merely taking longer than usual to initialize.
How kubelet runs each probe
Kubelet can run probes in different ways:
exechttpGettcpSocketgrpc
That means kubelet is not relying only on container exit codes. It actively tests the workload using the mechanism you configure.
Resource monitoring
To make sane decisions, the node needs visibility into resource usage.
cAdvisor — embedded in kubelet
Kubelet includes cAdvisor, which reads cgroup and container stats from the node.
That includes metrics such as:
- CPU usage
- memory working set and usage
- filesystem usage
- network stats
At a conceptual level, cAdvisor is the node's resource observer.
What data flows to the Metrics API
Some of this data feeds higher-level Kubernetes features such as the Metrics API used by tools like kubectl top and autoscaling components.
The important mental model is not the exact pipeline details. It is that:
resource numbers shown by Kubernetes ultimately come from measurements on the node, largely grounded in cgroup statistics.
Node pressure — when the node is in trouble
This is where the node stops acting like a polite host and starts acting like a survival system.
Kubelet watches for pressure signals such as:
MemoryPressureDiskPressurePIDPressure
These are signs that the node may no longer safely run everything currently placed on it.
Soft vs hard eviction thresholds
Kubelet can be configured with soft and hard eviction thresholds.
- Soft threshold: "If pressure persists for some duration, begin evicting Pods."
- Hard threshold: "This is urgent. Evict immediately."
This gives Kubernetes a way to respond before the node becomes completely unusable.
Eviction ordering by QoS class
When kubelet has to evict Pods, it does not choose randomly.
Kubernetes uses QoS classes and other signals to decide who should lose first.
The broad ordering is:
- BestEffort
- Burstable
- Guaranteed
This matches the intuition that workloads with no explicit resource guarantees are the easiest to sacrifice when the node is under stress.
Eviction decision tree
Node conditions vs taints
These two concepts are related, but not the same.
| Concept | What it describes | Example |
|---|---|---|
| Node condition | The node's observed health state | MemoryPressure=True |
| Taint | A scheduling signal applied to the node | NoSchedule, NoExecute |
A condition is an observation.
A taint is a policy signal that affects whether Pods may schedule onto or remain on the node.
That distinction matters a lot during incident debugging.
Eviction vs OOM kill vs restart loops
These three are often conflated during incidents, but they are different failure paths.
| Event type | Primary trigger | Primary actor | Typical signal you see |
|---|---|---|---|
| Eviction | Node-level pressure threshold crossed | kubelet | Pod evicted with pressure-related reason |
| OOM kill | cgroup memory boundary exceeded | Linux kernel | Container terminated as OOMKilled |
| Restart loop | app/probe keeps failing | kubelet + restart policy | repeated container restarts / CrashLoopBackOff |
The fastest way to debug is to first classify which of these three happened.
The OOM killer — when cgroups lose patience
One of the most misunderstood moments in Kubernetes happens when a container is killed for using too much memory.
Many developers think kubelet notices high memory and then kills the process.
Usually, that is not what happens.
cgroup limit hit → Linux OOM killer acts
When a container exceeds its effective memory limit, the Linux kernel is often the component that kills a process.
That is because memory enforcement is happening at the cgroup level.
So the sequence is more like:
- process consumes too much memory
- cgroup memory limit is breached
- kernel OOM logic selects a victim
- process is killed
- kubelet later observes the exit and updates status
oom_score_adj and how Kubernetes uses it
Linux uses a signal called oom_score_adj to bias which process should be killed first under OOM conditions.
Kubernetes sets this value differently depending on Pod QoS and priority characteristics, so some workloads are made more killable than others.
That is another example of Kubernetes influencing behavior, while the final act still belongs to the kernel.
kubelet detects exit → marks container as OOMKilled
After the kernel kills the process, kubelet notices the container exited and reports the reason back up as something like OOMKilled.
So when you see OOMKilled in Kubernetes, treat it as an important architectural clue:
the kernel enforced the memory boundary; kubelet recorded the aftermath.
OOM kill sequence diagram
First five checks during a node incident
When things look messy, do not start with theory. Start with a compact, repeatable inspection order.
kubectl describe podandkubectl describe nodefor events and conditions- kubelet logs on the affected node for probe, eviction, and runtime errors
crictl psandcrictl inspectfor runtime-level container state- CNI and dataplane sanity checks (interfaces, routes, plugin health)
- host memory/disk pressure signals and recent OOM/eviction evidence
This ordering usually tells you whether the issue is control-plane visibility, runtime execution, networking, or raw node resource pressure.
Dataplane note — kube-proxy and eBPF paths
This series used kube-proxy as the default teaching model because it remains broadly useful.
In modern clusters, some environments replace parts of kube-proxy behavior with eBPF-based datapaths (for example, Cilium-driven service routing).
The conceptual model still holds:
- Service traffic must be steered to real Pod endpoints
- endpoint updates must propagate into node-local forwarding behavior
- debugging still requires understanding both control-plane intent and node dataplane state
Only the implementation layer changes.
Garbage collection
Nodes also have to clean up after workload churn.
If they never did, disks would slowly fill with:
- dead containers
- unused writable layers
- old images
- orphaned artifacts
Dead containers and image layers
Kubelet works with the runtime to garbage-collect resources that are no longer needed.
This is not just hygiene. It is part of node survival.
If stale images and dead containers accumulate without cleanup, the node may eventually hit DiskPressure, which can trigger evictions or block new Pod scheduling.
Concepts introduced
Let's make the new vocabulary crisp.
QoS classes
Kubernetes groups Pods into three broad QoS classes:
| QoS class | Intuition |
|---|---|
Guaranteed | Requests and limits are tightly defined and aligned |
Burstable | Some guarantees exist, but usage may burst |
BestEffort | No explicit resource requests or limits |
These classes influence eviction and OOM behavior.
cAdvisor and the /metrics endpoint
cAdvisor is the node-local stats collector embedded in kubelet.
It is one of the reasons kubelet can expose metrics and resource observations that the rest of the Kubernetes ecosystem consumes.
oom_score_adj
This is Linux's hint for OOM victim selection priority.
Lower values generally make a process less likely to be killed; higher values make it more likely.
Kubernetes uses this hint to express workload importance under memory pressure.
Node conditions and taints
Remember the split:
- condition = observed node state
- taint = scheduling effect
This simple distinction removes a surprising amount of confusion during production incidents.
Final mental model
A node does not just run Pods.
It constantly balances four responsibilities:
- prove to the control plane that it is alive
- check whether workloads are healthy
- observe and protect finite machine resources
- shed load or sacrifice Pods when the machine is in danger
That is why Kubernetes operations feels different once you understand node internals. You stop seeing restarts, evictions, and OOMs as random events and start seeing them as the visible outcomes of a survival loop.
"A node doesn't just run pods — it fights to keep them running, and fights to protect itself when it can't. Understanding this loop is what separates someone who uses Kubernetes from someone who operates it."
By this point in the series, you have the main execution and survival loops of a Kubernetes node:
- what components live there
- how it earns trust
- how a Pod starts
- how the node stays alive under pressure
That mental model is the difference between using Kubernetes as a black box and being able to reason through what the box is doing.
But there is still one important node-local path left: how storage gets from a Pod spec to an actual mount on the machine.
That is the topic of Part 5.
If you want to extend this mental model further after that, the next natural deep dives are:
- the full Pod termination and graceful shutdown path
- runtime hardening controls at the kernel boundary