The Process: How the Kernel Fakes an Entire Machine

A process is not a program. It's the kernel's most fundamental illusion — a private machine that never existed. This post opens the box.

Posted May 18, 2026

By Shreyas K S

13 min read

In the last post, we established the rule: user code cannot touch hardware. Every request goes through a system call, every system call is validated, and the kernel never trusts you. That’s the trust invariant — the wall between your code and the machine.

But walls alone aren’t enough. The kernel doesn’t just block access to hardware. It creates a fiction. Every program you’ve ever launched believed it was the only thing running. It had its own memory. Its own CPU. Its own view of the filesystem. None of that was real. The kernel fabricated all of it.

That fiction has a name: the process.

A Program Is Not a Process

A program is a file on disk. Bytes. Instructions encoded in ELF or Mach-O format, sitting inert until something brings them to life.

A process is a program in execution — the living, running instance with state the kernel tracks.

  
// This is a program (static, on disk):
// /usr/bin/my_app — compiled binary, ~2MB

// When you run it, the kernel creates a process:
// - Loads the binary into a fresh address space
// - Sets the instruction pointer to the entry point
// - Allocates a stack
// - Assigns a PID
// - Starts execution

// Run it twice? Two processes. Same program. Separate everything.

You can run the same program ten times. You get ten processes — each with its own memory, its own PID, its own registers, its own file descriptors. They share nothing unless you explicitly arrange it. This is not a convenience. This is the isolation invariant at work.

graph LR
    subgraph Disk["On Disk"]
        BIN["my_app (binary)"]
    end

    BIN -->|"exec #1"| P1["Process 1\nPID: 1001\nMemory: 0x00..."]
    BIN -->|"exec #2"| P2["Process 2\nPID: 1002\nMemory: 0x00..."]
    BIN -->|"exec #3"| P3["Process 3\nPID: 1003\nMemory: 0x00..."]

    style Disk fill:#0f3460,stroke:#16213e,color:#fff
    style P1 fill:#1a1a2e,stroke:#e94560,color:#fff
    style P2 fill:#1a1a2e,stroke:#e94560,color:#fff
    style P3 fill:#1a1a2e,stroke:#e94560,color:#fff

Same binary. Three processes. Three separate universes. The kernel maintains this separation on every clock tick.

What the Kernel Tracks

Every process is expensive. Not because execution is expensive — addition and branching are cheap — but because the kernel must maintain a complete description of the process’s world. In Linux, that description lives in a structure called task_struct.

Here’s a simplified view of what it contains:

  
// Simplified model of what the kernel tracks per process.
// The real Linux task_struct is over 600 lines. This captures the essentials.

struct ProcessControlBlock {
    // Identity
    pid: u32,                    // unique process ID
    ppid: u32,                   // parent process ID
    state: ProcessState,         // running, ready, blocked

    // CPU state (saved on context switch)
    registers: RegisterSet,      // rax, rbx, rcx, ... rip, rsp, rflags

    // Memory
    page_table: *mut PageTable,  // virtual → physical address mapping

    // Files
    open_files: Vec<FileDescriptor>,  // fd 0=stdin, 1=stdout, 2=stderr, ...

    // Scheduling
    priority: i32,
    time_slice_remaining: u64,   // nanoseconds left before preemption

    // Accounting
    user_time: u64,              // CPU time in user mode
    kernel_time: u64,            // CPU time in kernel mode
}

enum ProcessState {
    Running,    // currently on a CPU
    Ready,      // wants to run, waiting for CPU
    Blocked,    // waiting for I/O or event
    Zombie,     // exited, parent hasn't collected status
}

This is the kernel’s mental model of your process. When the scheduler decides “process 1001 runs next,” it loads this structure, restores all the registers, switches to the correct page table, and jumps to wherever rip points. Your process resumes exactly where it left off, unaware that it was ever paused.

The process invariant: every process executes as if it were the only thing running on the machine. The kernel saves and restores enough state to maintain this illusion across every context switch, every interrupt, every preemption.

Process States

A process is always in one of three meaningful states. The transitions between them tell you everything about how the OS schedules work.

stateDiagram-v2
    [*] --> Ready: created
    Ready --> Running: scheduled by CPU
    Running --> Ready: preempted (time slice expired)
    Running --> Blocked: waiting for I/O / event
    Blocked --> Ready: I/O complete / event arrived
    Running --> [*]: exit()

Running — the process is on a CPU right now, executing instructions. On a 4-core machine, at most 4 processes are in this state at any moment.

Ready — the process could run, but the scheduler hasn’t picked it yet. It’s in the run queue, waiting its turn. This is where most processes spend most of their time.

Blocked — the process asked for something that isn’t ready yet. A disk read. A network packet. A mutex held by someone else. The kernel takes it off the run queue entirely. No point scheduling a process that can’t make progress.

  
// Your code does this:
let data = std::fs::read("big_file.dat")?;

// What actually happens:
// 1. Process is RUNNING — calls read() syscall
// 2. Kernel initiates disk I/O
// 3. Disk is slow — data won't arrive for milliseconds
// 4. Kernel marks process as BLOCKED
// 5. Scheduler picks another READY process to run
// 6. ... time passes, disk completes ...
// 7. Interrupt fires — kernel marks process as READY
// 8. Scheduler eventually picks it — process is RUNNING again
// 9. read() returns with the data
//
// Your code saw none of this. It blocked on read() and got data.

The transition from Running → Blocked → Ready → Running is the heartbeat of every multitasking OS. It’s why your machine feels responsive even with hundreds of processes: most of them are blocked, waiting for something. The ones that are ready compete for CPU time, and the scheduler rotates through them fast enough that each one gets a fair share.

Creating Processes: fork() and exec()

Unix creates processes in two steps. This design seems strange at first, but it’s one of the most elegant abstractions in systems programming.

fork() — duplicates the calling process. The kernel creates a new task_struct, copies the page table (using copy-on-write — more on that later), duplicates the file descriptor table, and assigns a new PID. The child is an almost-exact clone of the parent.

exec() — replaces the current process’s memory with a new program. The kernel loads a fresh binary, sets up a new stack, resets the instruction pointer to the new entry point. Same PID, same file descriptors, completely different code.

  
use std::process::Command;

fn main() {
    // Rust's Command API wraps fork() + exec()
    let output = Command::new("ls")
        .arg("-la")
        .output()
        .expect("failed to execute ls");

    println!("ls returned: {}", output.status);
}

// Under the hood:
// 1. fork()  — creates a child process (clone of parent)
// 2. In the child: exec("ls", ["-la"])  — replaces memory with /bin/ls
// 3. Parent calls wait() — blocks until child exits
// 4. Child runs ls, exits
// 5. Parent wakes up, reads exit status

sequenceDiagram
    participant P as Parent (PID 100)
    participant K as Kernel
    participant C as Child (PID 101)

    P->>K: fork()
    Note over K: Clone task_struct<br/>Copy page table (COW)<br/>Copy file descriptors<br/>Assign PID 101
    K->>P: returns 101 (child PID)
    K->>C: returns 0 (I am the child)

    C->>K: exec("/bin/ls", ["-la"])
    Note over K: Replace address space<br/>Load new binary<br/>Reset instruction pointer

    C->>C: Runs /bin/ls
    C->>K: exit(0)

    P->>K: wait()
    K->>P: Child exited with status 0

Why two steps? Because it gives you a window between fork and exec where the child can set itself up — redirect file descriptors, change the working directory, drop privileges — all before the new program starts. This composability is why Unix pipes work, why shell redirection works, why process supervision works.

  
// This is why you can do:  ls -la | grep ".rs"
//
// Shell forks child 1 (for ls):
//   - Redirects stdout to pipe write-end
//   - exec("ls", ["-la"])
//
// Shell forks child 2 (for grep):
//   - Redirects stdin to pipe read-end
//   - exec("grep", [".rs"])
//
// Both children inherit the pipe file descriptors from between
// fork() and exec(). The programs themselves know nothing about it.

Context Switching: The Expensive Illusion

The CPU runs one process at a time (per core). To give sixty processes the illusion of simultaneous execution, the kernel rapidly switches between them. Each switch is a context switch, and it’s the most performance-critical operation in the kernel.

Here’s what happens when the scheduler decides to switch from Process A to Process B:

sequenceDiagram
    participant A as Process A (Running)
    participant K as Kernel
    participant B as Process B (Ready)

    Note over A: Timer interrupt fires

    A->>K: Save A's registers to A's task_struct
    Note over K: Save: rax, rbx, rcx, rdx, rsi, rdi,<br/>rsp, rbp, r8-r15, rip, rflags, FPU/SSE state

    K->>K: Switch page table to B's address space
    Note over K: Load CR3 with B's page table base<br/>TLB is flushed — cached translations invalidated

    K->>B: Restore B's registers from B's task_struct
    Note over B: B resumes exactly where it left off

    Note over A: A is now Ready — waiting in the queue

The cost breakdown:

Save registers — fast, a few dozen stores (~50 ns)
Switch page table — write to CR3 register (~100 ns)
TLB flush — the real cost. The Translation Lookaside Buffer caches virtual-to-physical address translations. When you switch address spaces, those cached translations are wrong. The CPU must refill the TLB on every memory access until it warms up again. This can cost microseconds of subsequent execution.
Cache pollution — Process B’s data isn’t in the CPU cache. It was evicted by Process A’s working set. Cache misses pile up until B’s data warms the cache.

  
// A context switch itself takes ~1-5 microseconds.
// But the *indirect* cost (TLB misses, cache misses) can
// slow the next process for hundreds of microseconds.
//
// This is why:
// - Servers use thread pools instead of process-per-request
// - Game engines pin threads to cores
// - High-frequency trading systems isolate CPUs with isolcpus
// - The scheduler tries to minimize unnecessary switches

The restoration invariant: after a context switch, the resumed process cannot distinguish its state from what it was before the switch. Registers, flags, instruction pointer, stack pointer — all exactly as they were. The only observable effect is elapsed wall-clock time.

If the kernel fails to save even one register — say, it forgets to save the floating-point state — a process that was computing 3.14 * r * r might resume and find r has been overwritten by whatever the previous process left in that register. Silent data corruption. No crash. No error message. Just wrong answers.

Copy-on-Write: Faking the Copy

fork() says “copy the entire address space.” A process might have gigabytes of memory. Copying all of it on every fork would be absurdly expensive — and most of the time, the child immediately calls exec() and throws it all away.

The kernel cheats. It doesn’t copy memory at all. It marks both the parent’s and child’s page table entries as read-only and points them at the same physical pages. Both processes share the same memory. Neither knows.

When either process tries to write to a shared page, the CPU traps (page fault). The kernel catches it, copies just that one page, maps the copy into the writing process’s address space as read-write, and lets the write proceed. Only pages that are actually modified get copied.

graph TB
    subgraph Before["After fork() — shared pages"]
        direction LR
        PA1["Parent page table"] -->|"read-only"| PHY1["Physical Page (shared)"]
        CA1["Child page table"] -->|"read-only"| PHY1
    end

    subgraph After["After child writes — COW triggered"]
        direction LR
        PA2["Parent page table"] -->|"read-only"| PHY2["Original Page"]
        CA2["Child page table"] -->|"read-write"| PHY3["Copied Page (modified)"]
    end

    Before -->|"child writes to page"| After

    style Before fill:#1a1a2e,stroke:#e94560,color:#fff
    style After fill:#0f3460,stroke:#16213e,color:#fff
    style PHY1 fill:#e94560,stroke:#e94560,color:#fff
    style PHY3 fill:#16c79a,stroke:#16c79a,color:#fff

This is copy-on-write (COW). It turns fork from an O(memory) operation into an O(page-table-size) operation. In practice, a fork of a large process completes in microseconds instead of milliseconds — because no memory is actually copied until someone writes.

  
// Why COW matters:
//
// Without COW:
//   fork() a 1GB process → copy 1GB → child calls exec() → throw away 1GB
//   Total waste: 1GB copied, 1GB freed, 0 bytes useful
//
// With COW:
//   fork() a 1GB process → copy page table (~few KB) → child calls exec()
//   Total waste: a few kilobytes
//
// This is why fork() is fast enough to use in a shell that spawns
// hundreds of processes per script.

Putting It Together

The process is the kernel’s foundational abstraction. Everything else builds on it:

Concept	Depends On
Threads	Processes that share an address space
Containers	Processes with namespace isolation
Virtual machines	Processes that emulate hardware
Shell pipelines	Chains of processes connected by pipes

Every process believes it owns the machine. The kernel maintains this belief through:

A separate address space (page tables) — so your memory is private
Saved register state (task_struct) — so context switches are invisible
Validated system calls (the trust boundary) — so you can’t break out
Copy-on-write — so creating new processes is cheap

The process isolation invariant: no process can observe or modify another process’s state without explicit, kernel-mediated permission. Shared memory, pipes, signals — these are all opt-in. The default is total isolation. Every process is alone until the kernel says otherwise.

When this invariant breaks, the consequences scale with the breach. A memory corruption bug that leaks one process’s data to another is a security vulnerability. A CPU bug that leaks data across address spaces — like Meltdown — is a civilization-level incident that requires patching every operating system on every machine on the planet.

Next up: the scheduler — the kernel subsystem that decides who runs, for how long, and what happens when everyone wants the CPU at once. Fairness, starvation, and the invariant that every runnable process eventually gets its turn.

This is Post 3 of the series Invariants the Kernel Keeps — operating systems through the guarantees the kernel makes, and what happens when they break.

Systems, Operating-Systems

os invariants kernel process fork context-switch isolation

This post is licensed under CC BY 4.0 by the author.