Home/Blog/All/Kubernetes Node Internals — Part 3: Pod Lifecycle

Tech

Kubernetes Node Internals — Part 3: Pod Lifecycle

Part 3 of a 5-part series: the full birth sequence of a Pod, from scheduler decision and pause container to CNI wiring, overlayfs, and a running process.

← Back to all blogsMarch 26, 2026#tech#kubernetes#linux#containers

Kubernetes Node Internals — Part 3: Pod Lifecycle

"You ran kubectl apply -f pod.yaml. Somewhere around 300ms later, a Linux process is running on a node. Here is every single thing that happened in between."

This is the part of Kubernetes most people use every day, but few people trace end to end.

You declare a Pod. The scheduler picks a node. Then, very quickly, something on that machine turns YAML into:

  • a network namespace
  • a sandbox
  • one or more cgroups
  • a writable container filesystem
  • one or more Linux processes

That sequence is the birth of a Pod.

And once you see the sequence clearly, container startup stops feeling mysterious.

Series roadmap

  1. Part 1 — The anatomy of a node
  2. Part 2 — Bootstrap and the secret handshake
  3. Part 3 — A pod is born
  4. Part 4 — Keeping the node alive
  5. Part 5 — CSI, volumes, and mounts on the node

How kubelet watches for new Pods

By the time this story starts, the scheduler has already chosen a node.

That means the Pod now has a .spec.nodeName, and the kubelet running on that node is responsible for it.

The kubelet continuously watches the API server for changes relevant to that node.

The watch loop on the API server

Conceptually, kubelet maintains a watch-driven loop that keeps asking:

"Which Pods assigned to me should exist right now?"

When a new Pod appears for the node, kubelet does not instantly do everything in one giant blocking function. Instead, it turns the desired work into internal tasks and reconciliation steps.

Pod spec lands in kubelet's work queue

Once kubelet notices the new Pod, it adds that Pod to its internal work pipeline.

From there, kubelet drives a sequence roughly like this:

  1. prepare volumes and sandbox requirements
  2. ask the CRI runtime to create the Pod sandbox
  3. ensure networking is ready
  4. start init containers if present
  5. start app containers
  6. update status back to the API server

Kubernetes loves reconciliation loops, and Pod startup is no exception.

The pause container — the unsung hero

One of the least celebrated pieces in Kubernetes is the pause container.

Beginners often assume a Pod is just a group of containers launched together. But before the app containers start, Kubernetes usually creates a tiny Pod sandbox first.

That sandbox is commonly represented by the pause container.

Why it exists

The pause container exists mainly to provide a stable anchor for the Pod's shared environment.

That includes, most importantly:

  • the Pod's network namespace
  • and, in configurations that share it, the Pod's PID namespace

Think of it as the object that says:

"This Pod now has a place where shared namespaces live. Other containers can join this place."

Without that anchor, it would be much harder to give multiple containers in the same Pod one shared network identity.

Namespace anchor, created first, dies last

The pause container is created first because other containers need something to join.

It dies last because tearing it down too early would collapse the shared Pod sandbox underneath the app containers.

That is why the Pod sandbox and the app containers are not the same conceptual thing.

PID 1 and zombie reaping

You will often hear that the pause container is "PID 1." The precise details depend on whether the Pod shares a PID namespace, but the intuition is still useful:

  • there is a first process in the sandbox context
  • that process anchors the shared environment
  • if PID namespace sharing is enabled, it can also play the familiar PID 1 role of reaping zombie processes

The main takeaway is not the exact host PID value. It is that the sandbox must have a stable lifetime independent of any single app container.

Network namespace setup

Once kubelet asks the runtime to create the Pod sandbox, networking becomes the first major setup task.

To be slightly more precise, kubelet initiates sandbox creation through the CRI runtime, and the runtime is the component that actually creates the sandbox and invokes the CNI plugin.

kubelet initiates the netns creation, then CNI is called

From a high-level operator perspective, the flow looks like this:

  1. kubelet says, "Create the Pod sandbox"
  2. the runtime creates the sandbox and its network namespace
  3. the runtime invokes the CNI plugin
  4. the CNI plugin wires host and Pod networking together

CNI plugin wires a veth pair into the bridge

The common mental model is:

  • one end of a veth pair goes inside the Pod namespace as eth0
  • the other end remains on the host side
  • that host-side end is attached to a bridge or equivalent datapath

This gives the Pod a real Linux network interface inside its namespace.

IP address assigned from the node's PodCIDR

The Pod then receives an IP address, typically from the node's allocated Pod CIDR range.

That means each Pod gets its own IP, which is why Kubernetes networking feels different from classic Docker port-mapping mental models.

In Kubernetes, the default assumption is:

Pods are first-class IP endpoints.

DNS: how /etc/resolv.conf is written

The sandbox also gets DNS configuration.

That is why, inside a Pod, /etc/resolv.conf usually points at cluster DNS and includes the Pod's search domains, allowing names like:

  • my-service
  • my-service.my-namespace
  • my-service.my-namespace.svc.cluster.local

to resolve the way Kubernetes expects.

CNI call diagram

Rendering diagram…

Init containers, if any

If the Pod defines init containers, Kubernetes runs them before the main app containers.

They execute sequentially and must complete successfully before the next phase begins.

This is useful for setup work like:

  • waiting for dependencies
  • rendering config files
  • database migrations
  • downloading assets

Conceptually, init containers are part of the Pod birth sequence because the Pod is not considered fully started until that preparation work has finished.

App container startup

Now we get to the part most people think of as "starting the container." In reality, even this is several sub-steps.

1. Image pull

If the image is not already on the node, the runtime pulls it from a registry.

That includes:

  • resolving the image reference
  • authenticating if needed
  • downloading image layers

2. Layer extraction

Container images are not one giant tarball of a root filesystem. They are usually layered.

The runtime unpacks those layers so they can become the read-only lower layers of the container filesystem.

3. overlay filesystem (overlayfs)

Most Linux container runtimes use overlayfs or an equivalent union filesystem idea.

This allows the container to see:

  • shared read-only image layers underneath
  • plus a thin writable layer on top

That is why multiple containers using the same image do not each need their own full copy of every file.

overlayfs layer diagram

Rendering diagram…

The container experiences this as one merged root filesystem, even though underneath it is composed of multiple layers.

4. cgroups applied

Before or during container creation, the runtime ensures the appropriate cgroups exist for the Pod and container.

This is where Kubernetes resource policy turns into actual kernel-enforced controls.

5. runc forks the process into the namespaces

Finally, the runtime delegates to runc or another OCI runtime.

That runtime sets up:

  • namespaces
  • mounts
  • cgroups
  • environment
  • capabilities
  • working directory
  • entrypoint and arguments

and then launches the real process.

At that moment, the container becomes what it has always been underneath the abstraction:

a Linux process running under carefully prepared constraints.

Pod termination mirror path

Part 3 is mostly about birth, but operational reasoning gets much easier if you also know the reverse sequence.

When a Pod is terminated, the high-level path is usually:

  1. Pod deletion is requested
  2. endpoint updates begin removing it from Service backends
  3. kubelet sends SIGTERM to container processes
  4. preStop hooks run if defined
  5. graceful shutdown window (terminationGracePeriodSeconds) is honored
  6. any remaining processes receive SIGKILL
  7. sandbox and network resources are torn down

The same node layers are involved, just in reverse order.

Why this symmetry matters

Many production issues are not startup failures but shutdown timing failures:

  • process exits too slowly and gets force-killed
  • readiness drops too late and traffic arrives during shutdown
  • hooks do too much work and exceed grace windows

Understanding startup without termination leaves an important half of Pod lifecycle reasoning missing.

Where storage joins this flow

Networking is the most visible part of sandbox setup, but storage joins the Pod lifecycle too.

During startup, kubelet and the runtime also coordinate storage work such as:

  • asking the node-side volume manager to prepare required volumes
  • invoking CSI-driven mount work for persistent volumes when needed
  • bind-mounting prepared volume paths into the Pod sandbox and containers
  • container writable layer creation on top of image layers
  • ephemeral storage accounting for logs, writable layers, and emptyDir

That means Pod startup is not only about "network first, process second." It is also about making sure the container sees the right filesystem paths before the application begins.

This is why DiskPressure and startup behavior are often connected in real clusters.

We are intentionally not going deep on CSI in this part, because it deserves its own mental model. If networking gives the Pod an IP address, storage gives it a usable directory tree backed by something real. That full path is covered in Part 5 — CSI, volumes, and mounts on the node.

kube-proxy's role after pod start

The Pod is now running, but the cluster still needs to learn how to send traffic to it.

If the Pod matches a Service selector and is ready, it becomes part of that Service's backend set.

That causes endpoint data to update, and kube-proxy on nodes can refresh its local routing rules.

Endpoint added, iptables or IPVS rules updated

Once the endpoint is visible, kube-proxy updates the dataplane so Service traffic can reach the new Pod.

That may mean new iptables or ipvs entries, depending on cluster mode.

Service ClusterIP now routes to the Pod

This is the point where traffic sent to the Service can begin reaching the newly started backend.

So Pod birth is not complete merely when the process exists. It is complete when:

  • the process is running
  • the Pod is network-reachable
  • readiness is satisfied
  • service routing knows about it

Status reported back to the API server

Throughout the process, kubelet keeps reporting status upward.

That includes transitions such as:

  • image pulling
  • container creating
  • running
  • probe failures if any
  • restart states

This is what eventually surfaces to you through kubectl get pods and kubectl describe pod.

In other words, the API server is not watching the process directly. The kubelet is the node-local observer that reports what happened.

Startup latency hotspots

If Pod startup feels slow, the delay is usually concentrated in a few stages:

  • large image pulls or slow registry auth
  • CNI setup and IP allocation latency
  • heavy init container work
  • probe configuration that delays readiness transition

This gives you a practical triage order before diving into low-level logs.

Key diagram — full birth sequence timeline

Here is the full sequence in one numbered flow.

  1. You submit pod.yaml to the API server.
  2. The scheduler chooses a node.
  3. The Pod is now assigned to that node.
  4. The kubelet on that node sees the new Pod in its watch loop.
  5. Kubelet asks the CRI runtime to create a Pod sandbox.
  6. The runtime creates the pause container and sandbox namespaces.
  7. The runtime invokes the CNI plugin to configure networking.
  8. Volumes and DNS configuration are prepared.
  9. Init containers run, one by one, if defined.
  10. App images are pulled and unpacked; overlayfs layers are prepared.
  11. runc launches the app process with the right namespaces and cgroups.
  12. Kubelet reports status, and kube-proxy begins routing Service traffic once endpoints are ready.

If you want the shortest accurate summary of Pod creation, it is this:

kubelet asks the runtime to create a sandbox, wire networking, start containers, and then reports the results back to the control plane.

Concepts introduced

CNI spec — the interface contract

CNI stands for Container Network Interface.

Like CRI, it is a contract. It says, in effect:

"Given a container or sandbox network namespace, here is how a plugin should attach networking and report the result."

This is why Kubernetes can work with many networking implementations while keeping a consistent mental model.

overlayfslowerdir, upperdir, workdir

The image layers live in the read-only lowerdir stack.

The container's own writable changes go to upperdir.

The filesystem driver uses workdir internally to manage the merged view.

That is the filesystem reason containers feel lightweight.

Pod sandbox vs app container distinction

This is one of the most useful conceptual distinctions in Kubernetes internals.

ObjectPurpose
Pod sandboxHolds the shared Pod environment, especially networking
App containerRuns the application process inside that prepared environment

The sandbox comes first. The app containers join it.

Once that distinction is clear, the pause container stops feeling weird and starts feeling necessary.

Final mental model

A Pod is not born in one step.

It is born through a chain of lower-level operations:

  • watch
  • queue
  • sandbox
  • networking
  • filesystem setup
  • cgroups
  • process launch
  • endpoint registration
  • status reporting

That is what happened in the few hundred milliseconds between your YAML hitting the API server and your application logging its first line.

In Part 4, we will look at the other side of node life: not how a Pod starts, but how the node stays healthy, proves it is alive, handles pressure, and decides what to evict when things go wrong.

Storage deep dive: Part 5 — CSI, volumes, and mounts on the node

Next: Part 4 — Keeping the node alive