The Process: How the Kernel Fakes an Entire Machine
A process is not a program. It's the kernel's most fundamental illusion — a private machine that never existed. This post opens the box.
In the last post, we established the rule: user code cannot touch hardware. Every request goes through a system call, every system call is validated, and the kernel never trusts you. That’s the trust invariant — the wall between your code and the machine.
But walls alone aren’t enough. The kernel doesn’t just block access to hardware. It creates a fiction. Every program you’ve ever launched believed it was the only thing running. It had its own memory. Its own CPU. Its own view of the filesystem. None of that was real. The kernel fabricated all of it.
That fiction has a name: the process.
A Program Is Not a Process
A program is a file on disk. Bytes. Instructions encoded in ELF or Mach-O format, sitting inert until something brings them to life.
A process is a program in execution — the living, running instance with state the kernel tracks.
1
2
3
4
5
6
7
8
9
10
11
// This is a program (static, on disk):
// /usr/bin/my_app — compiled binary, ~2MB
// When you run it, the kernel creates a process:
// - Loads the binary into a fresh address space
// - Sets the instruction pointer to the entry point
// - Allocates a stack
// - Assigns a PID
// - Starts execution
// Run it twice? Two processes. Same program. Separate everything.
You can run the same program ten times. You get ten processes — each with its own memory, its own PID, its own registers, its own file descriptors. They share nothing unless you explicitly arrange it. This is not a convenience. This is the isolation invariant at work.
graph LR
subgraph Disk["On Disk"]
BIN["my_app (binary)"]
end
BIN -->|"exec #1"| P1["Process 1\nPID: 1001\nMemory: 0x00..."]
BIN -->|"exec #2"| P2["Process 2\nPID: 1002\nMemory: 0x00..."]
BIN -->|"exec #3"| P3["Process 3\nPID: 1003\nMemory: 0x00..."]
style Disk fill:#0f3460,stroke:#16213e,color:#fff
style P1 fill:#1a1a2e,stroke:#e94560,color:#fff
style P2 fill:#1a1a2e,stroke:#e94560,color:#fff
style P3 fill:#1a1a2e,stroke:#e94560,color:#fff
Same binary. Three processes. Three separate universes. The kernel maintains this separation on every clock tick.
What the Kernel Tracks
Every process is expensive. Not because execution is expensive — addition and branching are cheap — but because the kernel must maintain a complete description of the process’s world. In Linux, that description lives in a structure called task_struct.
Here’s a simplified view of what it contains:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Simplified model of what the kernel tracks per process.
// The real Linux task_struct is over 600 lines. This captures the essentials.
struct ProcessControlBlock {
// Identity
pid: u32, // unique process ID
ppid: u32, // parent process ID
state: ProcessState, // running, ready, blocked
// CPU state (saved on context switch)
registers: RegisterSet, // rax, rbx, rcx, ... rip, rsp, rflags
// Memory
page_table: *mut PageTable, // virtual → physical address mapping
// Files
open_files: Vec<FileDescriptor>, // fd 0=stdin, 1=stdout, 2=stderr, ...
// Scheduling
priority: i32,
time_slice_remaining: u64, // nanoseconds left before preemption
// Accounting
user_time: u64, // CPU time in user mode
kernel_time: u64, // CPU time in kernel mode
}
enum ProcessState {
Running, // currently on a CPU
Ready, // wants to run, waiting for CPU
Blocked, // waiting for I/O or event
Zombie, // exited, parent hasn't collected status
}
This is the kernel’s mental model of your process. When the scheduler decides “process 1001 runs next,” it loads this structure, restores all the registers, switches to the correct page table, and jumps to wherever rip points. Your process resumes exactly where it left off, unaware that it was ever paused.
The process invariant: every process executes as if it were the only thing running on the machine. The kernel saves and restores enough state to maintain this illusion across every context switch, every interrupt, every preemption.
Process States
A process is always in one of three meaningful states. The transitions between them tell you everything about how the OS schedules work.
stateDiagram-v2
[*] --> Ready: created
Ready --> Running: scheduled by CPU
Running --> Ready: preempted (time slice expired)
Running --> Blocked: waiting for I/O / event
Blocked --> Ready: I/O complete / event arrived
Running --> [*]: exit()
Running — the process is on a CPU right now, executing instructions. On a 4-core machine, at most 4 processes are in this state at any moment.
Ready — the process could run, but the scheduler hasn’t picked it yet. It’s in the run queue, waiting its turn. This is where most processes spend most of their time.
Blocked — the process asked for something that isn’t ready yet. A disk read. A network packet. A mutex held by someone else. The kernel takes it off the run queue entirely. No point scheduling a process that can’t make progress.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Your code does this:
let data = std::fs::read("big_file.dat")?;
// What actually happens:
// 1. Process is RUNNING — calls read() syscall
// 2. Kernel initiates disk I/O
// 3. Disk is slow — data won't arrive for milliseconds
// 4. Kernel marks process as BLOCKED
// 5. Scheduler picks another READY process to run
// 6. ... time passes, disk completes ...
// 7. Interrupt fires — kernel marks process as READY
// 8. Scheduler eventually picks it — process is RUNNING again
// 9. read() returns with the data
//
// Your code saw none of this. It blocked on read() and got data.
The transition from Running → Blocked → Ready → Running is the heartbeat of every multitasking OS. It’s why your machine feels responsive even with hundreds of processes: most of them are blocked, waiting for something. The ones that are ready compete for CPU time, and the scheduler rotates through them fast enough that each one gets a fair share.
Creating Processes: fork() and exec()
Unix creates processes in two steps. This design seems strange at first, but it’s one of the most elegant abstractions in systems programming.
fork() — duplicates the calling process. The kernel creates a new task_struct, copies the page table (using copy-on-write — more on that later), duplicates the file descriptor table, and assigns a new PID. The child is an almost-exact clone of the parent.
exec() — replaces the current process’s memory with a new program. The kernel loads a fresh binary, sets up a new stack, resets the instruction pointer to the new entry point. Same PID, same file descriptors, completely different code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
use std::process::Command;
fn main() {
// Rust's Command API wraps fork() + exec()
let output = Command::new("ls")
.arg("-la")
.output()
.expect("failed to execute ls");
println!("ls returned: {}", output.status);
}
// Under the hood:
// 1. fork() — creates a child process (clone of parent)
// 2. In the child: exec("ls", ["-la"]) — replaces memory with /bin/ls
// 3. Parent calls wait() — blocks until child exits
// 4. Child runs ls, exits
// 5. Parent wakes up, reads exit status
sequenceDiagram
participant P as Parent (PID 100)
participant K as Kernel
participant C as Child (PID 101)
P->>K: fork()
Note over K: Clone task_struct<br/>Copy page table (COW)<br/>Copy file descriptors<br/>Assign PID 101
K->>P: returns 101 (child PID)
K->>C: returns 0 (I am the child)
C->>K: exec("/bin/ls", ["-la"])
Note over K: Replace address space<br/>Load new binary<br/>Reset instruction pointer
C->>C: Runs /bin/ls
C->>K: exit(0)
P->>K: wait()
K->>P: Child exited with status 0
Why two steps? Because it gives you a window between fork and exec where the child can set itself up — redirect file descriptors, change the working directory, drop privileges — all before the new program starts. This composability is why Unix pipes work, why shell redirection works, why process supervision works.
1
2
3
4
5
6
7
8
9
10
11
12
// This is why you can do: ls -la | grep ".rs"
//
// Shell forks child 1 (for ls):
// - Redirects stdout to pipe write-end
// - exec("ls", ["-la"])
//
// Shell forks child 2 (for grep):
// - Redirects stdin to pipe read-end
// - exec("grep", [".rs"])
//
// Both children inherit the pipe file descriptors from between
// fork() and exec(). The programs themselves know nothing about it.
Context Switching: The Expensive Illusion
The CPU runs one process at a time (per core). To give sixty processes the illusion of simultaneous execution, the kernel rapidly switches between them. Each switch is a context switch, and it’s the most performance-critical operation in the kernel.
Here’s what happens when the scheduler decides to switch from Process A to Process B:
sequenceDiagram
participant A as Process A (Running)
participant K as Kernel
participant B as Process B (Ready)
Note over A: Timer interrupt fires
A->>K: Save A's registers to A's task_struct
Note over K: Save: rax, rbx, rcx, rdx, rsi, rdi,<br/>rsp, rbp, r8-r15, rip, rflags, FPU/SSE state
K->>K: Switch page table to B's address space
Note over K: Load CR3 with B's page table base<br/>TLB is flushed — cached translations invalidated
K->>B: Restore B's registers from B's task_struct
Note over B: B resumes exactly where it left off
Note over A: A is now Ready — waiting in the queue
The cost breakdown:
- Save registers — fast, a few dozen stores (~50 ns)
- Switch page table — write to CR3 register (~100 ns)
- TLB flush — the real cost. The Translation Lookaside Buffer caches virtual-to-physical address translations. When you switch address spaces, those cached translations are wrong. The CPU must refill the TLB on every memory access until it warms up again. This can cost microseconds of subsequent execution.
- Cache pollution — Process B’s data isn’t in the CPU cache. It was evicted by Process A’s working set. Cache misses pile up until B’s data warms the cache.
1
2
3
4
5
6
7
8
9
// A context switch itself takes ~1-5 microseconds.
// But the *indirect* cost (TLB misses, cache misses) can
// slow the next process for hundreds of microseconds.
//
// This is why:
// - Servers use thread pools instead of process-per-request
// - Game engines pin threads to cores
// - High-frequency trading systems isolate CPUs with isolcpus
// - The scheduler tries to minimize unnecessary switches
The restoration invariant: after a context switch, the resumed process cannot distinguish its state from what it was before the switch. Registers, flags, instruction pointer, stack pointer — all exactly as they were. The only observable effect is elapsed wall-clock time.
If the kernel fails to save even one register — say, it forgets to save the floating-point state — a process that was computing 3.14 * r * r might resume and find r has been overwritten by whatever the previous process left in that register. Silent data corruption. No crash. No error message. Just wrong answers.
Copy-on-Write: Faking the Copy
fork() says “copy the entire address space.” A process might have gigabytes of memory. Copying all of it on every fork would be absurdly expensive — and most of the time, the child immediately calls exec() and throws it all away.
The kernel cheats. It doesn’t copy memory at all. It marks both the parent’s and child’s page table entries as read-only and points them at the same physical pages. Both processes share the same memory. Neither knows.
When either process tries to write to a shared page, the CPU traps (page fault). The kernel catches it, copies just that one page, maps the copy into the writing process’s address space as read-write, and lets the write proceed. Only pages that are actually modified get copied.
graph TB
subgraph Before["After fork() — shared pages"]
direction LR
PA1["Parent page table"] -->|"read-only"| PHY1["Physical Page (shared)"]
CA1["Child page table"] -->|"read-only"| PHY1
end
subgraph After["After child writes — COW triggered"]
direction LR
PA2["Parent page table"] -->|"read-only"| PHY2["Original Page"]
CA2["Child page table"] -->|"read-write"| PHY3["Copied Page (modified)"]
end
Before -->|"child writes to page"| After
style Before fill:#1a1a2e,stroke:#e94560,color:#fff
style After fill:#0f3460,stroke:#16213e,color:#fff
style PHY1 fill:#e94560,stroke:#e94560,color:#fff
style PHY3 fill:#16c79a,stroke:#16c79a,color:#fff
This is copy-on-write (COW). It turns fork from an O(memory) operation into an O(page-table-size) operation. In practice, a fork of a large process completes in microseconds instead of milliseconds — because no memory is actually copied until someone writes.
1
2
3
4
5
6
7
8
9
10
11
12
// Why COW matters:
//
// Without COW:
// fork() a 1GB process → copy 1GB → child calls exec() → throw away 1GB
// Total waste: 1GB copied, 1GB freed, 0 bytes useful
//
// With COW:
// fork() a 1GB process → copy page table (~few KB) → child calls exec()
// Total waste: a few kilobytes
//
// This is why fork() is fast enough to use in a shell that spawns
// hundreds of processes per script.
Putting It Together
The process is the kernel’s foundational abstraction. Everything else builds on it:
| Concept | Depends On |
|---|---|
| Threads | Processes that share an address space |
| Containers | Processes with namespace isolation |
| Virtual machines | Processes that emulate hardware |
| Shell pipelines | Chains of processes connected by pipes |
Every process believes it owns the machine. The kernel maintains this belief through:
- A separate address space (page tables) — so your memory is private
- Saved register state (task_struct) — so context switches are invisible
- Validated system calls (the trust boundary) — so you can’t break out
- Copy-on-write — so creating new processes is cheap
The process isolation invariant: no process can observe or modify another process’s state without explicit, kernel-mediated permission. Shared memory, pipes, signals — these are all opt-in. The default is total isolation. Every process is alone until the kernel says otherwise.
When this invariant breaks, the consequences scale with the breach. A memory corruption bug that leaks one process’s data to another is a security vulnerability. A CPU bug that leaks data across address spaces — like Meltdown — is a civilization-level incident that requires patching every operating system on every machine on the planet.
Next up: the scheduler — the kernel subsystem that decides who runs, for how long, and what happens when everyone wants the CPU at once. Fairness, starvation, and the invariant that every runnable process eventually gets its turn.
This is Post 3 of the series Invariants the Kernel Keeps — operating systems through the guarantees the kernel makes, and what happens when they break.
Stay in the loop
Subscribe via RSS to get new posts on systems, Rust, and cryptography.
Subscribe to RSS