Post

Crossing the Boundary: System Calls, Privilege, and the Trust Invariant

The kernel never trusts you. Every system call crosses a hardware-enforced privilege boundary — and that boundary is the most important invariant in the entire OS.

Crossing the Boundary: System Calls, Privilege, and the Trust Invariant

In the last post, we said an OS is a collection of invariants. Virtualization, concurrency, persistence — promises the kernel makes and must never break. But we skipped the obvious question: how does it enforce them?

The answer is a boundary. A hardware-enforced, CPU-checked, non-negotiable wall between your code and the kernel. Your program runs on one side. The OS runs on the other. And the only way to cross is through a controlled gate called a system call.

This boundary is the single most important mechanism in operating systems. Every other invariant depends on it.

Two Modes, One CPU

Every modern CPU operates in at least two privilege levels. x86 calls them rings. ARM calls them exception levels. The names differ. The principle doesn’t.

Ring 3 (user mode) — where your program lives. It can compute, branch, allocate from its own address space, call its own functions. It cannot touch hardware. It cannot modify page tables. It cannot disable interrupts. It cannot read kernel memory. The CPU checks every instruction against the current privilege level and faults on violations.

Ring 0 (kernel mode) — where the OS lives. It can do everything. All instructions. All memory. All devices. All I/O ports. There are no restrictions.

This is not a software convention. It’s a hardware gate. The privilege bit lives in a CPU register, and the only way to flip it from user to kernel is through a trap — a controlled entry point that the kernel itself configured at boot time.

graph TB
    subgraph Ring3["Ring 3 - User Mode"]
        direction LR
        P["Your Program"]
        CAN1["Can: Compute, branch, loop"]
        CANT1["Cannot: Touch hardware"]
        CANT2["Cannot: Modify page tables"]
        CANT3["Cannot: Read kernel memory"]
    end

    P -.->|"SYSCALL instruction"| TRAP["TRAP - Hardware Gate"]

    subgraph Ring0["Ring 0 - Kernel Mode"]
        direction LR
        K["Kernel"]
        CAN2["All instructions"]
        CAN3["All memory"]
        CAN4["All devices"]
    end

    TRAP -->|"controlled entry"| K
    K -->|"SYSRET"| P

    style Ring3 fill:#1a1a2e,stroke:#e94560,color:#fff
    style Ring0 fill:#0f3460,stroke:#16213e,color:#fff
    style TRAP fill:#e94560,stroke:#e94560,color:#fff

The privilege invariant: untrusted code cannot execute privileged instructions without kernel mediation. The CPU enforces this on every instruction, every cycle. No exceptions.

System Calls vs. Procedural Calls

Your program makes function calls constantly. calculate(x, y) pushes arguments, jumps to an address, runs code, returns a value. The entire thing happens in user mode. Same privilege level. Same address space. No boundary crossed. This is a procedural call — a standard interaction between functions within a program.

But when your program calls read(fd, buf, 4096) to read from a file, something fundamentally different happens. Your program can’t read from a disk — it doesn’t have permission to talk to hardware. It needs the kernel. So it has to ask.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Procedural call — stays in user mode
fn add(a: u64, b: u64) -> u64 {
    a + b  // ~2 nanoseconds. No mode switch. No validation.
}

// System call — crosses into kernel mode
use std::fs::File;
use std::io::Read;

fn read_file() -> std::io::Result<Vec<u8>> {
    let mut f = File::open("data.txt")?;  // syscall: open()
    let mut buf = Vec::new();
    f.read_to_end(&mut buf)?;             // syscall: read()
    Ok(buf)
    // Each of these triggered a trap into the kernel.
    // The CPU switched to ring 0, validated everything,
    // did the work, switched back to ring 3.
}

The cost difference is staggering. A procedural call takes a few nanoseconds. A system call takes hundreds — sometimes microseconds. You’re paying for a mode switch, register save/restore, argument validation, the actual work, and a mode switch back. That’s easily 100–150x overhead compared to a regular function call.

This overhead is not waste. It’s the price of safety.

graph LR
    subgraph Procedural["Procedural Call"]
        direction TB
        F1["caller()"] -->|"push args, jump"| F2["function()"]
        F2 -->|"return"| F1
    end

    subgraph Syscall["System Call"]
        direction TB
        U["User code"] -->|"SYSCALL"| T["Hardware Trap"]
        T -->|"ring 0"| K["Kernel: validate, execute"]
        K -->|"SYSRET"| U
    end

    style Procedural fill:#16213e,stroke:#16213e,color:#fff
    style Syscall fill:#1a1a2e,stroke:#e94560,color:#fff
    style T fill:#e94560,stroke:#e94560,color:#fff

Procedural call: same mode, ~2 ns. System call: mode switch + validation, ~300 ns. The 150x overhead buys safety.

What Happens During a System Call

Let’s trace write(1, "hello\n", 6) — writing six bytes to stdout on x86-64 Linux.

Your C library (or Rust’s std) puts the syscall number in %rax (1 for write), the file descriptor in %rdi (1 for stdout), a pointer to the data in %rsi, and the byte count in %rdx. Then it executes the SYSCALL instruction.

The CPU takes over. It saves the user’s instruction pointer and flags. It loads the kernel’s entry point from a special register (configured at boot — the user can’t change it). It flips the privilege bit to ring 0. It switches to the kernel stack.

Now the kernel runs. It saves all user registers. It looks up syscall number 1 in the syscall table. It calls sys_write. And then — the critical part — it validates everything.

Is fd=1 a valid, open file descriptor for this process? Is the pointer %rsi actually in user-space memory, not kernel memory? Is the byte count reasonable? Does the calling process have write permission on this descriptor?

The kernel checks all of this before it does any work. If any check fails, the syscall returns an error. The kernel never follows a user-provided pointer blindly. It never assumes a file descriptor is valid. It never trusts the byte count.

sequenceDiagram
    participant U as User Program
    participant CPU as CPU (SYSCALL)
    participant K as Kernel

    U->>U: Set registers rax=1, rdi=1, rsi=buf, rdx=6
    U->>CPU: SYSCALL instruction

    Note over CPU: Save user RIP + RFLAGS, switch to Ring 0, load kernel entry point

    CPU->>K: Enter kernel at syscall handler

    Note over K: Save all user registers, look up syscall table[1], call sys_write()

    rect rgb(233, 69, 96)
        Note over K: VALIDATE: Is fd valid? Is buf in userspace? Is count reasonable? Has write permission?
    end

    K->>K: Write bytes to device driver
    K->>CPU: Return value in %rax

    Note over CPU: Restore user registers, switch to Ring 3

    CPU->>U: Continue execution

The trust invariant: the kernel never trusts data from userspace. Every pointer is bounds-checked. Every file descriptor is validated. Every argument is verified before use.

If the kernel skipped this — if it dereferenced a user-provided pointer without checking — a malicious program could pass a kernel address and trick the kernel into overwriting its own data structures. This single class of bugs (missing validation of userspace pointers) has been responsible for dozens of Linux privilege escalation exploits.

Why Not Just Let Programs Talk to Hardware?

Early computers did exactly this. In the 1950s batch processing era, one program ran at a time with full control of the machine. No protection. No isolation. No system calls. The program was the OS.

This worked fine until someone wanted to run two programs. Without kernel mediation:

  • Two programs could write to the same disk sector simultaneously — data corruption
  • A buggy program could overwrite another program’s memory — silent state corruption
  • A crashed program could leave hardware in an undefined state — the next program inherits chaos
  • A malicious program could read anything in memory — passwords, keys, other users’ data
graph TB
    subgraph Without["Without Protection"]
        direction TB
        A1["Program A"] -->|"direct write"| D1["Disk Sector 42"]
        B1["Program B"] -->|"direct write"| D1
        D1 --> CORRUPT["Data Corruption"]
    end

    subgraph With["With Protection"]
        direction TB
        A2["Program A"] -->|"write()"| OS["OS Kernel"]
        B2["Program B"] -->|"write()"| OS
        OS --> D2["Disk"]
        D2 --> SAFE["Safe"]
    end

    style Without fill:#4a0000,stroke:#e94560,color:#fff
    style With fill:#003a1a,stroke:#16c79a,color:#fff
    style CORRUPT fill:#e94560,stroke:#e94560,color:#fff
    style SAFE fill:#16c79a,stroke:#16c79a,color:#fff

The privilege boundary exists because sharing hardware without mediation is chaos. The moment you have two programs on one machine, you need an arbiter. That arbiter needs authority they can’t override. Hardware privilege levels give the kernel that authority.

The Cost, and How to Minimize It

System calls are expensive. High-performance systems go to extraordinary lengths to reduce crossings:

Buffered I/Oprintln! in Rust doesn’t call write() per character. It collects output in a user-space buffer and flushes with a single syscall. One crossing instead of hundreds.

Memory-mapped filesmmap a file into your address space once, then read and write through normal memory operations. The page fault handler does the I/O transparently. No explicit read()/write() syscalls at all.

io_uring — Linux’s newest I/O interface. You submit batches of I/O requests through a shared-memory ring buffer. The kernel processes them and posts completions to another ring. At steady state, you can do thousands of I/O operations with zero syscalls per operation.

1
2
3
4
5
6
7
8
// The evolution of "read from disk" in terms of syscall overhead:
//
// Traditional:    one read() syscall per operation
// Buffered:       one read() per buffer-full (e.g., 4KB at a time)
// mmap:           zero read() calls — page faults handle it
// io_uring:       zero syscalls at steady state — shared memory rings
//
// Each step reduces boundary crossings while respecting the boundary.

Every one of these optimizations works around the cost of crossing — while still respecting the invariant. You can minimize crossings. You cannot bypass the boundary.

The Invariant, Restated

Let’s be precise about what the kernel promises at the privilege boundary:

  1. User code cannot execute privileged instructions. Hardware enforces this — not software, not convention, not trust. The CPU faults on violations.

  2. The only way to enter the kernel is through a controlled trap. System calls, hardware interrupts, processor exceptions. All enter through entry points the kernel configured at boot. The user cannot redirect them.

  3. The kernel never trusts values from userspace. Every argument is validated. Every pointer is bounds-checked. Every file descriptor is verified. Violations return errors — they never cause kernel-side corruption.

  4. After the kernel returns, user state is fully restored. The program cannot tell it was interrupted, except by elapsed time. Registers, flags, stack — all restored exactly.

When this invariant holds, a buggy program crashes itself but not the system. A malicious program wastes its own CPU time but cannot read another process’s data. The kernel is a fortress, and the system call is the guarded gate.

When this invariant breaks — through a kernel bug, a speculative execution side channel, or a misconfigured permission — the consequences are catastrophic. We’ll see exactly how catastrophic in the final post of this series, when we examine Spectre and Meltdown: attacks that violated the memory isolation invariant through the CPU itself.

Next up: the process — the kernel’s most fundamental abstraction. How the OS creates the illusion that every program owns the entire machine, and what data structures it maintains to keep that illusion intact.


This is Post 2 of the series Invariants the Kernel Keeps — operating systems through the guarantees the kernel makes, and what happens when they break.

This post is licensed under CC BY 4.0 by the author.