Home/Blog/All/Kubernetes Node Internals — Part 4: Keeping It Alive

Tech

Kubernetes Node Internals — Part 4: Keeping It Alive

Part 4 of a 5-part series: heartbeats, probes, cAdvisor, node pressure, evictions, OOM kills, and the survival logic that keeps a node functioning.

← Back to all blogsMarch 30, 2026#tech#kubernetes#linux#containers

Kubernetes Node Internals — Part 4: Keeping It Alive

Starting Pods is only half the story.

Once workloads are running, the node has to keep answering a harder question:

Can I keep these Pods healthy without destroying myself?

That is the day-to-day operational loop of a Kubernetes node.

It sends heartbeats. It runs health probes. It monitors resource usage. It detects pressure. It evicts Pods when necessary. And sometimes, when memory runs out, the Linux kernel makes the brutal decision first and only tells Kubernetes afterward.

This is the part of the system that feels most like operations rather than deployment.

Series roadmap

Part 1 — The anatomy of a node
Part 2 — Bootstrap and the secret handshake
Part 3 — A pod is born
Part 4 — Keeping the node alive
Part 5 — CSI, volumes, and mounts on the node

The heartbeat — how a node stays alive

A node does not stay healthy merely because its machine is powered on.

The control plane needs ongoing proof that the node is still responsive.

In modern Kubernetes, that proof usually comes in two related forms:

Node Lease renewals
NodeStatus updates

Node Lease object

The lightweight heartbeat path uses a Lease object in the coordination.k8s.io API group.

The kubelet renews this Lease frequently, commonly around every 10 seconds by default.

Why use a Lease?

Because a full Node object update is relatively heavy. A Lease is small, cheap, and perfect for the narrow question:

"Is this node still here?"

NodeStatus updates

The kubelet also publishes NodeStatus updates, which contain more detailed information such as:

conditions
capacity
allocatable resources
addresses
runtime details

These updates are heavier than Lease renewals, so they are not the preferred mechanism for frequent heartbeat-only signaling.

What happens when the heartbeat stops

If the control plane stops hearing from a node, it does not instantly assume disaster. It waits for a configured grace window.

If the heartbeat still does not return:

the node may be marked NotReady or Unknown
controllers stop trusting it as healthy execution capacity
Pods on that node may later be evicted or rescheduled depending on controller behavior and timeout settings

This is why a node failure can feel delayed from the user's perspective. Kubernetes tries not to overreact to very short network blips.

Heartbeat timeline — Lease vs NodeStatus cadence

Rendering diagram…

The exact defaults can vary by version and configuration, but the mental model is stable: Lease is the lightweight pulse; NodeStatus is the richer health report.

Pod health checks

Once Pods are running, kubelet needs to know whether they are healthy enough to keep running or healthy enough to receive traffic.

That is where probes come in.

Probe	What question it answers	What happens on failure
`livenessProbe`	"Should this container be restarted?"	kubelet restarts the container
`readinessProbe`	"Should this container receive traffic?"	Pod is removed from Service endpoints
`startupProbe`	"Is this slow-starting app still allowed time to boot?"	Other probes are delayed until startup succeeds

`livenessProbe`

Liveness is about survival.

If the probe fails repeatedly, kubelet assumes the container is unhealthy in a way that restart might fix: deadlock, stuck event loop, hung process, or some other bad steady state.

`readinessProbe`

Readiness is about traffic eligibility.

A container may be alive but not yet ready:

cache warmup not finished
migrations still running
downstream dependency missing
application intentionally draining before shutdown

When readiness fails, Kubernetes does not necessarily kill the container. It simply stops routing Service traffic to it.

`startupProbe`

Startup probes exist for slow-starting applications.

Without them, a liveness probe can accidentally kill a container that is merely taking longer than usual to initialize.

How kubelet runs each probe

Kubelet can run probes in different ways:

exec
httpGet
tcpSocket
grpc

That means kubelet is not relying only on container exit codes. It actively tests the workload using the mechanism you configure.

Resource monitoring

To make sane decisions, the node needs visibility into resource usage.

cAdvisor — embedded in kubelet

Kubelet includes cAdvisor, which reads cgroup and container stats from the node.

That includes metrics such as:

CPU usage
memory working set and usage
filesystem usage
network stats

At a conceptual level, cAdvisor is the node's resource observer.

What data flows to the Metrics API

Some of this data feeds higher-level Kubernetes features such as the Metrics API used by tools like kubectl top and autoscaling components.

The important mental model is not the exact pipeline details. It is that:

resource numbers shown by Kubernetes ultimately come from measurements on the node, largely grounded in cgroup statistics.

Node pressure — when the node is in trouble

This is where the node stops acting like a polite host and starts acting like a survival system.

Kubelet watches for pressure signals such as:

MemoryPressure
DiskPressure
PIDPressure

These are signs that the node may no longer safely run everything currently placed on it.

Soft vs hard eviction thresholds

Kubelet can be configured with soft and hard eviction thresholds.

Soft threshold: "If pressure persists for some duration, begin evicting Pods."
Hard threshold: "This is urgent. Evict immediately."

This gives Kubernetes a way to respond before the node becomes completely unusable.

Eviction ordering by QoS class

When kubelet has to evict Pods, it does not choose randomly.

Kubernetes uses QoS classes and other signals to decide who should lose first.

The broad ordering is:

BestEffort
Burstable
Guaranteed

This matches the intuition that workloads with no explicit resource guarantees are the easiest to sacrifice when the node is under stress.

Eviction decision tree

Rendering diagram…

Node conditions vs taints

These two concepts are related, but not the same.

Concept	What it describes	Example
Node condition	The node's observed health state	`MemoryPressure=True`
Taint	A scheduling signal applied to the node	`NoSchedule`, `NoExecute`

A condition is an observation.

A taint is a policy signal that affects whether Pods may schedule onto or remain on the node.

That distinction matters a lot during incident debugging.

Eviction vs OOM kill vs restart loops

These three are often conflated during incidents, but they are different failure paths.

Event type	Primary trigger	Primary actor	Typical signal you see
Eviction	Node-level pressure threshold crossed	kubelet	Pod evicted with pressure-related reason
OOM kill	cgroup memory boundary exceeded	Linux kernel	Container terminated as `OOMKilled`
Restart loop	app/probe keeps failing	kubelet + restart policy	repeated container restarts / CrashLoopBackOff

The fastest way to debug is to first classify which of these three happened.

The OOM killer — when cgroups lose patience

One of the most misunderstood moments in Kubernetes happens when a container is killed for using too much memory.

Many developers think kubelet notices high memory and then kills the process.

Usually, that is not what happens.

cgroup limit hit → Linux OOM killer acts

When a container exceeds its effective memory limit, the Linux kernel is often the component that kills a process.

That is because memory enforcement is happening at the cgroup level.

So the sequence is more like:

process consumes too much memory
cgroup memory limit is breached
kernel OOM logic selects a victim
process is killed
kubelet later observes the exit and updates status

`oom_score_adj` and how Kubernetes uses it

Linux uses a signal called oom_score_adj to bias which process should be killed first under OOM conditions.

Kubernetes sets this value differently depending on Pod QoS and priority characteristics, so some workloads are made more killable than others.

That is another example of Kubernetes influencing behavior, while the final act still belongs to the kernel.

kubelet detects exit → marks container as `OOMKilled`

After the kernel kills the process, kubelet notices the container exited and reports the reason back up as something like OOMKilled.

So when you see OOMKilled in Kubernetes, treat it as an important architectural clue:

the kernel enforced the memory boundary; kubelet recorded the aftermath.

OOM kill sequence diagram

Rendering diagram…

First five checks during a node incident

When things look messy, do not start with theory. Start with a compact, repeatable inspection order.

kubectl describe pod and kubectl describe node for events and conditions
kubelet logs on the affected node for probe, eviction, and runtime errors
crictl ps and crictl inspect for runtime-level container state
CNI and dataplane sanity checks (interfaces, routes, plugin health)
host memory/disk pressure signals and recent OOM/eviction evidence

This ordering usually tells you whether the issue is control-plane visibility, runtime execution, networking, or raw node resource pressure.

Dataplane note — kube-proxy and eBPF paths

This series used kube-proxy as the default teaching model because it remains broadly useful.

In modern clusters, some environments replace parts of kube-proxy behavior with eBPF-based datapaths (for example, Cilium-driven service routing).

The conceptual model still holds:

Service traffic must be steered to real Pod endpoints
endpoint updates must propagate into node-local forwarding behavior
debugging still requires understanding both control-plane intent and node dataplane state

Only the implementation layer changes.

Garbage collection

Nodes also have to clean up after workload churn.

If they never did, disks would slowly fill with:

dead containers
unused writable layers
old images
orphaned artifacts

Dead containers and image layers

Kubelet works with the runtime to garbage-collect resources that are no longer needed.

This is not just hygiene. It is part of node survival.

If stale images and dead containers accumulate without cleanup, the node may eventually hit DiskPressure, which can trigger evictions or block new Pod scheduling.

Concepts introduced

Let's make the new vocabulary crisp.

QoS classes

Kubernetes groups Pods into three broad QoS classes:

QoS class	Intuition
`Guaranteed`	Requests and limits are tightly defined and aligned
`Burstable`	Some guarantees exist, but usage may burst
`BestEffort`	No explicit resource requests or limits

These classes influence eviction and OOM behavior.

cAdvisor and the `/metrics` endpoint

cAdvisor is the node-local stats collector embedded in kubelet.

It is one of the reasons kubelet can expose metrics and resource observations that the rest of the Kubernetes ecosystem consumes.

`oom_score_adj`

This is Linux's hint for OOM victim selection priority.

Lower values generally make a process less likely to be killed; higher values make it more likely.

Kubernetes uses this hint to express workload importance under memory pressure.

Node conditions and taints

Remember the split:

condition = observed node state
taint = scheduling effect

This simple distinction removes a surprising amount of confusion during production incidents.

Final mental model

A node does not just run Pods.

It constantly balances four responsibilities:

prove to the control plane that it is alive
check whether workloads are healthy
observe and protect finite machine resources
shed load or sacrifice Pods when the machine is in danger

That is why Kubernetes operations feels different once you understand node internals. You stop seeing restarts, evictions, and OOMs as random events and start seeing them as the visible outcomes of a survival loop.

"A node doesn't just run pods — it fights to keep them running, and fights to protect itself when it can't. Understanding this loop is what separates someone who uses Kubernetes from someone who operates it."

By this point in the series, you have the main execution and survival loops of a Kubernetes node:

what components live there
how it earns trust
how a Pod starts
how the node stays alive under pressure

That mental model is the difference between using Kubernetes as a black box and being able to reason through what the box is doing.

But there is still one important node-local path left: how storage gets from a Pod spec to an actual mount on the machine.

That is the topic of Part 5.

If you want to extend this mental model further after that, the next natural deep dives are:

the full Pod termination and graceful shutdown path
runtime hardening controls at the kernel boundary

Next: Part 5 — CSI, volumes, and mounts on the node

Kubernetes Node Internals — Part 4: Keeping It Alive

Series roadmap

The heartbeat — how a node stays alive

Node Lease object

NodeStatus updates

What happens when the heartbeat stops

Heartbeat timeline — Lease vs NodeStatus cadence

Pod health checks

livenessProbe

readinessProbe

startupProbe

How kubelet runs each probe

Resource monitoring

cAdvisor — embedded in kubelet

What data flows to the Metrics API

Node pressure — when the node is in trouble

Soft vs hard eviction thresholds

Eviction ordering by QoS class

Eviction decision tree

Node conditions vs taints

Eviction vs OOM kill vs restart loops

The OOM killer — when cgroups lose patience

cgroup limit hit → Linux OOM killer acts

oom_score_adj and how Kubernetes uses it

kubelet detects exit → marks container as OOMKilled

OOM kill sequence diagram

First five checks during a node incident

Dataplane note — kube-proxy and eBPF paths

Garbage collection

Dead containers and image layers

Concepts introduced

QoS classes

cAdvisor and the /metrics endpoint

oom_score_adj

Node conditions and taints

Final mental model

`livenessProbe`

`readinessProbe`

`startupProbe`

`oom_score_adj` and how Kubernetes uses it

kubelet detects exit → marks container as `OOMKilled`

cAdvisor and the `/metrics` endpoint

`oom_score_adj`