Home/Blog/All/Kubernetes Node Internals — Part 4: Keeping It Alive

Tech

Kubernetes Node Internals — Part 4: Keeping It Alive

Part 4 of a 5-part series: heartbeats, probes, cAdvisor, node pressure, evictions, OOM kills, and the survival logic that keeps a node functioning.

← Back to all blogsMarch 30, 2026#tech#kubernetes#linux#containers

Kubernetes Node Internals — Part 4: Keeping It Alive

Starting Pods is only half the story.

Once workloads are running, the node has to keep answering a harder question:

Can I keep these Pods healthy without destroying myself?

That is the day-to-day operational loop of a Kubernetes node.

It sends heartbeats. It runs health probes. It monitors resource usage. It detects pressure. It evicts Pods when necessary. And sometimes, when memory runs out, the Linux kernel makes the brutal decision first and only tells Kubernetes afterward.

This is the part of the system that feels most like operations rather than deployment.

Series roadmap

  1. Part 1 — The anatomy of a node
  2. Part 2 — Bootstrap and the secret handshake
  3. Part 3 — A pod is born
  4. Part 4 — Keeping the node alive
  5. Part 5 — CSI, volumes, and mounts on the node

The heartbeat — how a node stays alive

A node does not stay healthy merely because its machine is powered on.

The control plane needs ongoing proof that the node is still responsive.

In modern Kubernetes, that proof usually comes in two related forms:

  • Node Lease renewals
  • NodeStatus updates

Node Lease object

The lightweight heartbeat path uses a Lease object in the coordination.k8s.io API group.

The kubelet renews this Lease frequently, commonly around every 10 seconds by default.

Why use a Lease?

Because a full Node object update is relatively heavy. A Lease is small, cheap, and perfect for the narrow question:

"Is this node still here?"

NodeStatus updates

The kubelet also publishes NodeStatus updates, which contain more detailed information such as:

  • conditions
  • capacity
  • allocatable resources
  • addresses
  • runtime details

These updates are heavier than Lease renewals, so they are not the preferred mechanism for frequent heartbeat-only signaling.

What happens when the heartbeat stops

If the control plane stops hearing from a node, it does not instantly assume disaster. It waits for a configured grace window.

If the heartbeat still does not return:

  1. the node may be marked NotReady or Unknown
  2. controllers stop trusting it as healthy execution capacity
  3. Pods on that node may later be evicted or rescheduled depending on controller behavior and timeout settings

This is why a node failure can feel delayed from the user's perspective. Kubernetes tries not to overreact to very short network blips.

Heartbeat timeline — Lease vs NodeStatus cadence

Rendering diagram…

The exact defaults can vary by version and configuration, but the mental model is stable: Lease is the lightweight pulse; NodeStatus is the richer health report.

Pod health checks

Once Pods are running, kubelet needs to know whether they are healthy enough to keep running or healthy enough to receive traffic.

That is where probes come in.

ProbeWhat question it answersWhat happens on failure
livenessProbe"Should this container be restarted?"kubelet restarts the container
readinessProbe"Should this container receive traffic?"Pod is removed from Service endpoints
startupProbe"Is this slow-starting app still allowed time to boot?"Other probes are delayed until startup succeeds

livenessProbe

Liveness is about survival.

If the probe fails repeatedly, kubelet assumes the container is unhealthy in a way that restart might fix: deadlock, stuck event loop, hung process, or some other bad steady state.

readinessProbe

Readiness is about traffic eligibility.

A container may be alive but not yet ready:

  • cache warmup not finished
  • migrations still running
  • downstream dependency missing
  • application intentionally draining before shutdown

When readiness fails, Kubernetes does not necessarily kill the container. It simply stops routing Service traffic to it.

startupProbe

Startup probes exist for slow-starting applications.

Without them, a liveness probe can accidentally kill a container that is merely taking longer than usual to initialize.

How kubelet runs each probe

Kubelet can run probes in different ways:

  • exec
  • httpGet
  • tcpSocket
  • grpc

That means kubelet is not relying only on container exit codes. It actively tests the workload using the mechanism you configure.

Resource monitoring

To make sane decisions, the node needs visibility into resource usage.

cAdvisor — embedded in kubelet

Kubelet includes cAdvisor, which reads cgroup and container stats from the node.

That includes metrics such as:

  • CPU usage
  • memory working set and usage
  • filesystem usage
  • network stats

At a conceptual level, cAdvisor is the node's resource observer.

What data flows to the Metrics API

Some of this data feeds higher-level Kubernetes features such as the Metrics API used by tools like kubectl top and autoscaling components.

The important mental model is not the exact pipeline details. It is that:

resource numbers shown by Kubernetes ultimately come from measurements on the node, largely grounded in cgroup statistics.

Node pressure — when the node is in trouble

This is where the node stops acting like a polite host and starts acting like a survival system.

Kubelet watches for pressure signals such as:

  • MemoryPressure
  • DiskPressure
  • PIDPressure

These are signs that the node may no longer safely run everything currently placed on it.

Soft vs hard eviction thresholds

Kubelet can be configured with soft and hard eviction thresholds.

  • Soft threshold: "If pressure persists for some duration, begin evicting Pods."
  • Hard threshold: "This is urgent. Evict immediately."

This gives Kubernetes a way to respond before the node becomes completely unusable.

Eviction ordering by QoS class

When kubelet has to evict Pods, it does not choose randomly.

Kubernetes uses QoS classes and other signals to decide who should lose first.

The broad ordering is:

  1. BestEffort
  2. Burstable
  3. Guaranteed

This matches the intuition that workloads with no explicit resource guarantees are the easiest to sacrifice when the node is under stress.

Eviction decision tree

Rendering diagram…

Node conditions vs taints

These two concepts are related, but not the same.

ConceptWhat it describesExample
Node conditionThe node's observed health stateMemoryPressure=True
TaintA scheduling signal applied to the nodeNoSchedule, NoExecute

A condition is an observation.

A taint is a policy signal that affects whether Pods may schedule onto or remain on the node.

That distinction matters a lot during incident debugging.

Eviction vs OOM kill vs restart loops

These three are often conflated during incidents, but they are different failure paths.

Event typePrimary triggerPrimary actorTypical signal you see
EvictionNode-level pressure threshold crossedkubeletPod evicted with pressure-related reason
OOM killcgroup memory boundary exceededLinux kernelContainer terminated as OOMKilled
Restart loopapp/probe keeps failingkubelet + restart policyrepeated container restarts / CrashLoopBackOff

The fastest way to debug is to first classify which of these three happened.

The OOM killer — when cgroups lose patience

One of the most misunderstood moments in Kubernetes happens when a container is killed for using too much memory.

Many developers think kubelet notices high memory and then kills the process.

Usually, that is not what happens.

cgroup limit hit → Linux OOM killer acts

When a container exceeds its effective memory limit, the Linux kernel is often the component that kills a process.

That is because memory enforcement is happening at the cgroup level.

So the sequence is more like:

  1. process consumes too much memory
  2. cgroup memory limit is breached
  3. kernel OOM logic selects a victim
  4. process is killed
  5. kubelet later observes the exit and updates status

oom_score_adj and how Kubernetes uses it

Linux uses a signal called oom_score_adj to bias which process should be killed first under OOM conditions.

Kubernetes sets this value differently depending on Pod QoS and priority characteristics, so some workloads are made more killable than others.

That is another example of Kubernetes influencing behavior, while the final act still belongs to the kernel.

kubelet detects exit → marks container as OOMKilled

After the kernel kills the process, kubelet notices the container exited and reports the reason back up as something like OOMKilled.

So when you see OOMKilled in Kubernetes, treat it as an important architectural clue:

the kernel enforced the memory boundary; kubelet recorded the aftermath.

OOM kill sequence diagram

Rendering diagram…

First five checks during a node incident

When things look messy, do not start with theory. Start with a compact, repeatable inspection order.

  1. kubectl describe pod and kubectl describe node for events and conditions
  2. kubelet logs on the affected node for probe, eviction, and runtime errors
  3. crictl ps and crictl inspect for runtime-level container state
  4. CNI and dataplane sanity checks (interfaces, routes, plugin health)
  5. host memory/disk pressure signals and recent OOM/eviction evidence

This ordering usually tells you whether the issue is control-plane visibility, runtime execution, networking, or raw node resource pressure.

Dataplane note — kube-proxy and eBPF paths

This series used kube-proxy as the default teaching model because it remains broadly useful.

In modern clusters, some environments replace parts of kube-proxy behavior with eBPF-based datapaths (for example, Cilium-driven service routing).

The conceptual model still holds:

  • Service traffic must be steered to real Pod endpoints
  • endpoint updates must propagate into node-local forwarding behavior
  • debugging still requires understanding both control-plane intent and node dataplane state

Only the implementation layer changes.

Garbage collection

Nodes also have to clean up after workload churn.

If they never did, disks would slowly fill with:

  • dead containers
  • unused writable layers
  • old images
  • orphaned artifacts

Dead containers and image layers

Kubelet works with the runtime to garbage-collect resources that are no longer needed.

This is not just hygiene. It is part of node survival.

If stale images and dead containers accumulate without cleanup, the node may eventually hit DiskPressure, which can trigger evictions or block new Pod scheduling.

Concepts introduced

Let's make the new vocabulary crisp.

QoS classes

Kubernetes groups Pods into three broad QoS classes:

QoS classIntuition
GuaranteedRequests and limits are tightly defined and aligned
BurstableSome guarantees exist, but usage may burst
BestEffortNo explicit resource requests or limits

These classes influence eviction and OOM behavior.

cAdvisor and the /metrics endpoint

cAdvisor is the node-local stats collector embedded in kubelet.

It is one of the reasons kubelet can expose metrics and resource observations that the rest of the Kubernetes ecosystem consumes.

oom_score_adj

This is Linux's hint for OOM victim selection priority.

Lower values generally make a process less likely to be killed; higher values make it more likely.

Kubernetes uses this hint to express workload importance under memory pressure.

Node conditions and taints

Remember the split:

  • condition = observed node state
  • taint = scheduling effect

This simple distinction removes a surprising amount of confusion during production incidents.

Final mental model

A node does not just run Pods.

It constantly balances four responsibilities:

  1. prove to the control plane that it is alive
  2. check whether workloads are healthy
  3. observe and protect finite machine resources
  4. shed load or sacrifice Pods when the machine is in danger

That is why Kubernetes operations feels different once you understand node internals. You stop seeing restarts, evictions, and OOMs as random events and start seeing them as the visible outcomes of a survival loop.

"A node doesn't just run pods — it fights to keep them running, and fights to protect itself when it can't. Understanding this loop is what separates someone who uses Kubernetes from someone who operates it."

By this point in the series, you have the main execution and survival loops of a Kubernetes node:

  • what components live there
  • how it earns trust
  • how a Pod starts
  • how the node stays alive under pressure

That mental model is the difference between using Kubernetes as a black box and being able to reason through what the box is doing.

But there is still one important node-local path left: how storage gets from a Pod spec to an actual mount on the machine.

That is the topic of Part 5.

If you want to extend this mental model further after that, the next natural deep dives are:

  • the full Pod termination and graceful shutdown path
  • runtime hardening controls at the kernel boundary

Next: Part 5 — CSI, volumes, and mounts on the node