Tech
Kubernetes Node Internals — Part 1: Anatomy
Part 1 of a 5-part series: the layered anatomy of a Kubernetes node, from hardware and kernel to kubelet, CRI, OCI, and the control plane boundary.
Kubernetes Node Internals — Part 1: Anatomy
"A node looks like a virtual machine. But what's actually running on it is a precise stack of six or seven processes — each with one job. Pull any one out and the whole thing stops working. Let's meet them."
When people first learn Kubernetes, they usually start from the top: Pods, Deployments, Services, Ingress, and maybe Helm charts.
But eventually you hit a more fundamental question:
What is actually happening on the machine that runs my Pod?
That machine is the node. And while it may look like a regular VM or Linux server, a Kubernetes node is really a carefully layered runtime stack. Each layer does exactly one thing. The kubelet talks to the container runtime. The container runtime delegates to a lower-level runtime. That lower-level runtime asks the Linux kernel to create isolated processes. And the kernel is the only layer that can actually enforce the illusion that a container is "its own little machine."
This post is Part 1 of a 5-part series on what happens inside a Kubernetes node.
Series roadmap
- Part 1 — The anatomy of a node
- Part 2 — Bootstrap and the secret handshake
- Part 3 — A pod is born
- Part 4 — Keeping the node alive
- Part 5 — CSI, volumes, and mounts on the node
Scope note — what this part intentionally skips
Part 1 is about the execution stack anatomy and component boundaries.
To keep that model clean, it intentionally does not go deep into:
- Pod termination and graceful shutdown timing
- node storage internals (CSI, volume mounts on the node, and ephemeral storage pressure)
- runtime hardening details (seccomp, AppArmor, SELinux, capabilities)
Those are important, but they are easier to understand once the core stack is clear. We will come back to the storage path explicitly in Part 5.
What is a Kubernetes node?
A Kubernetes node is a machine that can run Pods.
That machine might be:
- a cloud VM in EKS, GKE, or AKS
- a bare-metal server in your data center
- a local machine in Minikube or kind
Conceptually, though, a node is not just "a server in the cluster." It is the execution environment where Kubernetes turns a desired state into running Linux processes.
If the control plane is the part of Kubernetes that decides, the node is the part that does.
The scheduler may decide that your Pod should run on node-7, but nothing actually happens until software on node-7 performs the work:
- download the image
- prepare isolation
- create cgroups
- wire networking
- start processes
- report status back
That is why understanding the node is so valuable. If something breaks while a Pod starts, stops, gets OOM-killed, or loses networking, the explanation is almost always somewhere inside the node stack.
The layered stack
The easiest way to understand a node is as a stack of layers.
From bottom to top, it looks like this:
- Hardware
- Linux kernel
- OCI runtime like
runc - Container runtime / CRI implementation like
containerdorCRI-O - kubelet
- Kubernetes API server on the control plane
Let's walk upward.
Control loops at a glance
Across the series, you can think in three loops:
- Trust loop (Part 2): how a machine becomes a trusted node
- Creation loop (Part 3): how desired Pod state becomes running processes
- Survival loop (Part 4): how the node stays healthy under pressure
Part 1 is the static map of components these loops run through.
1. Hardware
At the bottom is the actual machine: CPUs, memory, disks, network interfaces, NUMA topology, and device drivers.
Containers do not run on abstract magic. They run on real cores, allocate real RAM, and write to real block devices.
When you request:
resources:
requests:
cpu: "500m"
memory: "512Mi"you are ultimately asking Kubernetes to carve out a safe slice of real machine resources.
2. Linux kernel
The kernel is the most important layer on the node.
It is the kernel that provides the primitives containers rely on:
- namespaces for isolation
- cgroups for resource control
- virtual networking primitives like veth devices and bridges
- filesystems like
overlayfs - syscalls for process creation and management
This is worth stating clearly:
Containers are not a kernel feature called "containers."
Containers are a packaging idea built out of multiple Linux kernel features working together.
3. OCI runtime
An OCI runtime is the tool that performs the final, low-level act of launching a containerized process.
The most common example is runc.
runc is not a high-level orchestrator. It does not schedule Pods. It does not watch the API server. It does not pull images. It takes a prepared container spec and says, in effect:
"Create this process with these namespaces, these cgroup limits, this root filesystem, and this command."
It is very close to the kernel.
4. Container runtime / CRI implementation
Above runc sits a higher-level container runtime such as containerd or CRI-O.
This layer knows how to:
- pull images
- unpack layers
- manage container lifecycle
- expose a stable interface to kubelet
- invoke a lower-level OCI runtime such as
runc
Think of this layer as the container operations manager.
If runc is the final worker that starts a process, containerd or CRI-O is the supervisor that knows how to prepare the work and keep track of many containers over time.
5. kubelet
The kubelet is the primary Kubernetes agent running on the node.
It watches the API server and asks a simple question over and over:
"What Pods should exist on this machine, and what do I need to do to make reality match that?"
The kubelet does not launch containers directly. Instead, it delegates to the container runtime via the Container Runtime Interface (CRI).
That separation is one of Kubernetes' most important design choices. Kubernetes does not want to be tied to one runtime implementation.
6. API server
Finally, above the node stack is the API server, which lives on the control plane.
The API server is not on the node to run workloads. It is the source of truth for cluster state.
The kubelet constantly communicates with it:
- reading desired state
- writing Pod status
- renewing heartbeats
- updating Node information
The node is where the work happens. The API server is where the intent and recorded state live.
The cast of characters
If you SSH into a node and inspect the important moving parts, these are the names you should know.
| Component | What it is | Main responsibility | Why it exists |
|---|---|---|---|
kubelet | Kubernetes node agent | Makes actual state match Pod specs | The node needs a local Kubernetes brain |
containerd or CRI-O | Container runtime / CRI implementation | Pulls images, manages sandboxes and containers | kubelet needs a runtime it can talk to consistently |
runc | OCI runtime | Creates the containerized Linux process | Somebody must do the final clone, setns, mount, and exec work |
kube-proxy | Node networking agent | Programs service routing rules | Services need local packet steering |
crictl | Troubleshooting CLI | Debugs CRI-compatible runtimes | Humans need a runtime-level inspection tool |
Let's make each one concrete.
kubelet — the node's brain
If the node had to be explained in one sentence, it would be this:
kubelet is the process that turns a Pod spec into reality.
It is responsible for:
- watching for Pods assigned to the node
- creating Pod sandboxes through the runtime
- starting and stopping containers
- mounting volumes
- running health probes
- reporting container and Pod status back to the API server
- publishing node conditions such as
Ready,MemoryPressure, andDiskPressure
The kubelet is not the scheduler. It does not decide which node should run the Pod. That decision was already made elsewhere. The kubelet only handles: "This Pod is now your problem."
CRI-O and containerd — the CRI layer
Kubelet needs a runtime, but it does not want runtime-specific code for every vendor and implementation. That is where the Container Runtime Interface (CRI) comes in.
CRI defines a standard contract for operations such as:
- create a Pod sandbox
- pull an image
- start a container
- stop a container
- list containers
- fetch logs and status
containerd and CRI-O are two popular runtimes that implement this contract.
Important nuance: a node usually runs one CRI implementation, not both. In practice you will typically see either:
containerd: broadly used, general-purpose, widely adoptedCRI-O: purpose-built for Kubernetes, tightly aligned with CRI and OCI
From kubelet's perspective, both solve the same problem:
"I need a thing behind a socket that knows how to create and manage containers."
runc — the OCI runtime
If containerd or CRI-O is the manager, runc is the mechanic that actually starts the engine.
It consumes an OCI bundle: a filesystem plus a config.json describing mounts, namespaces, environment variables, capabilities, cgroups, and the command to execute.
Then it asks the Linux kernel to make the process real.
That is an important mental model:
- kubelet does not start your process
containerd/CRI-Ousually does not directly start your process eitherruncperforms the low-level launch- the kernel is the only thing that can enforce the isolation and limits
kube-proxy — the network plumber
Pods get IP addresses, but Kubernetes networking is not only about Pod-to-Pod reachability. Services also need a way to route traffic to the correct backend Pods.
That is where kube-proxy typically enters the picture.
On many clusters, kube-proxy runs on every node and programs network rules using:
iptables- or
ipvs
Its job is to translate a stable Service virtual IP into one of the real backend Pod IPs.
So if a packet arrives for 10.96.12.4:80, kube-proxy may rewrite or steer it toward one of the actual Pod endpoints behind that Service.
It is the node's packet plumber.
Small accuracy note: modern clusters sometimes replace kube-proxy behavior with eBPF-based systems such as Cilium. But kube-proxy is still the default mental model most Kubernetes users should start with.
crictl — the debugging CLI, not a dependency
crictl is often misunderstood.
It is not required for Kubernetes to run.
Instead, it is a human-facing troubleshooting tool that can talk directly to a CRI-compatible runtime. If kubectl shows something odd, crictl helps you inspect what the runtime believes is happening.
Examples of what crictl is good for:
- listing Pod sandboxes
- listing containers
- checking image state
- reading container status when kubelet output is not enough
Think of crictl as runtime-level stethoscope, not a piece of the production dependency chain.
The dependency chain
Now we can connect the pieces.
When a Pod starts on a node, the control flow usually looks like this:
In compact form:
This chain explains a lot of real-world debugging:
- If kubelet is unhealthy, Pods may never start.
- If the CRI runtime is down, kubelet has nobody to delegate to.
- If
runccannot set up the container, the container never becomes a process. - If the kernel cannot provide namespaces, cgroups, mounts, or networking, everything above it fails.
If you only remember one startup debugging heuristic, use this order:
kubelet → CRI runtime → OCI runtime → kernel primitives
The gRPC socket between kubelet and CRI
One of the cleanest architectural decisions in Kubernetes is that kubelet and the runtime usually communicate over a local Unix socket using gRPC.
This gives Kubernetes a stable abstraction boundary.
Kubelet can say things like:
RunPodSandboxCreateContainerStartContainerStopPodSandbox
without caring whether the implementation behind the socket is containerd or CRI-O.
That decoupling is what makes Kubernetes modular.
What the node does NOT own
A very common beginner mistake is to mentally place every Kubernetes component on the node.
That is not how the architecture works.
Here are the major pieces the node does not own.
| Component | Lives where conceptually | What it does |
|---|---|---|
etcd | Control plane | Stores cluster state |
kube-apiserver | Control plane | Exposes the Kubernetes API |
kube-scheduler | Control plane | Chooses which node should run a Pod |
kube-controller-manager | Control plane | Runs reconciliation loops for cluster-level controllers |
The node does not schedule itself. It does not store cluster truth. It does not make global placement decisions.
It is the execution worker, not the cluster authority.
One caveat: in local learning setups or single-node clusters, these components may happen to run on the same machine. But logically they still belong to the control plane, not the node execution stack.
Key diagram — the vertical stack
Here is the simplest useful diagram for this entire post:
Read the diagram bottom to top when asking, "What makes the container possible?"
Read it top to bottom when asking, "How does a Pod spec become a process?"
Concepts introduced
Before moving to Part 2, make sure these three ideas are solid.
1. Container Runtime Interface (CRI)
CRI is the contract between kubelet and the container runtime.
It exists so Kubernetes can delegate container operations without being tightly coupled to one runtime implementation.
2. Open Container Initiative (OCI)
OCI defines standards around container image and runtime formats.
In practice, when people say OCI runtime, they usually mean a runtime like runc that knows how to launch a container from an OCI-compliant spec.
3. gRPC socket between kubelet and the runtime
The kubelet talks to the runtime over a local gRPC API, typically exposed through a Unix domain socket.
That local socket is the handoff point between "Kubernetes orchestration" and "container lifecycle management."
Final mental model
If you remember only one thing from this post, remember this:
A Kubernetes node is not one thing. It is a stack.
- The kubelet manages desired vs actual state.
- The CRI runtime manages container lifecycle.
- The OCI runtime launches the process.
- The kernel provides the isolation and limits.
- The hardware provides the real resources.
Once that layering clicks, a lot of Kubernetes starts to feel less magical.
In Part 2, we will go one layer deeper: the Linux primitives, TLS bootstrapping, and the security handshake that allows a node to join the cluster in the first place.