Crossing the Boundary: System Calls, Privilege, and the Trust Invariant
The kernel never trusts you. Every system call crosses a hardware-enforced privilege boundary — and that boundary is the most important invariant in the entire OS.
In the last post, we said an OS is a collection of invariants. Virtualization, concurrency, persistence — promises the kernel makes and must never break. But we skipped the obvious question: how does it enforce them?
The answer is a boundary. A hardware-enforced, CPU-checked, non-negotiable wall between your code and the kernel. Your program runs on one side. The OS runs on the other. And the only way to cross is through a controlled gate called a system call.
This boundary is the single most important mechanism in operating systems. Every other invariant depends on it.
Two Modes, One CPU
Every modern CPU operates in at least two privilege levels. x86 calls them rings. ARM calls them exception levels. The names differ. The principle doesn’t.
Ring 3 (user mode) — where your program lives. It can compute, branch, allocate from its own address space, call its own functions. It cannot touch hardware. It cannot modify page tables. It cannot disable interrupts. It cannot read kernel memory. The CPU checks every instruction against the current privilege level and faults on violations.
Ring 0 (kernel mode) — where the OS lives. It can do everything. All instructions. All memory. All devices. All I/O ports. There are no restrictions.
This is not a software convention. It’s a hardware gate. The privilege bit lives in a CPU register, and the only way to flip it from user to kernel is through a trap — a controlled entry point that the kernel itself configured at boot time.
graph TB
subgraph Ring3["Ring 3 - User Mode"]
direction LR
P["Your Program"]
CAN1["Can: Compute, branch, loop"]
CANT1["Cannot: Touch hardware"]
CANT2["Cannot: Modify page tables"]
CANT3["Cannot: Read kernel memory"]
end
P -.->|"SYSCALL instruction"| TRAP["TRAP - Hardware Gate"]
subgraph Ring0["Ring 0 - Kernel Mode"]
direction LR
K["Kernel"]
CAN2["All instructions"]
CAN3["All memory"]
CAN4["All devices"]
end
TRAP -->|"controlled entry"| K
K -->|"SYSRET"| P
style Ring3 fill:#1a1a2e,stroke:#e94560,color:#fff
style Ring0 fill:#0f3460,stroke:#16213e,color:#fff
style TRAP fill:#e94560,stroke:#e94560,color:#fff
The privilege invariant: untrusted code cannot execute privileged instructions without kernel mediation. The CPU enforces this on every instruction, every cycle. No exceptions.
System Calls vs. Procedural Calls
Your program makes function calls constantly. calculate(x, y) pushes arguments, jumps to an address, runs code, returns a value. The entire thing happens in user mode. Same privilege level. Same address space. No boundary crossed. This is a procedural call — a standard interaction between functions within a program.
But when your program calls read(fd, buf, 4096) to read from a file, something fundamentally different happens. Your program can’t read from a disk — it doesn’t have permission to talk to hardware. It needs the kernel. So it has to ask.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Procedural call — stays in user mode
fn add(a: u64, b: u64) -> u64 {
a + b // ~2 nanoseconds. No mode switch. No validation.
}
// System call — crosses into kernel mode
use std::fs::File;
use std::io::Read;
fn read_file() -> std::io::Result<Vec<u8>> {
let mut f = File::open("data.txt")?; // syscall: open()
let mut buf = Vec::new();
f.read_to_end(&mut buf)?; // syscall: read()
Ok(buf)
// Each of these triggered a trap into the kernel.
// The CPU switched to ring 0, validated everything,
// did the work, switched back to ring 3.
}
The cost difference is staggering. A procedural call takes a few nanoseconds. A system call takes hundreds — sometimes microseconds. You’re paying for a mode switch, register save/restore, argument validation, the actual work, and a mode switch back. That’s easily 100–150x overhead compared to a regular function call.
This overhead is not waste. It’s the price of safety.
graph LR
subgraph Procedural["Procedural Call"]
direction TB
F1["caller()"] -->|"push args, jump"| F2["function()"]
F2 -->|"return"| F1
end
subgraph Syscall["System Call"]
direction TB
U["User code"] -->|"SYSCALL"| T["Hardware Trap"]
T -->|"ring 0"| K["Kernel: validate, execute"]
K -->|"SYSRET"| U
end
style Procedural fill:#16213e,stroke:#16213e,color:#fff
style Syscall fill:#1a1a2e,stroke:#e94560,color:#fff
style T fill:#e94560,stroke:#e94560,color:#fff
Procedural call: same mode, ~2 ns. System call: mode switch + validation, ~300 ns. The 150x overhead buys safety.
What Happens During a System Call
Let’s trace write(1, "hello\n", 6) — writing six bytes to stdout on x86-64 Linux.
Your C library (or Rust’s std) puts the syscall number in %rax (1 for write), the file descriptor in %rdi (1 for stdout), a pointer to the data in %rsi, and the byte count in %rdx. Then it executes the SYSCALL instruction.
The CPU takes over. It saves the user’s instruction pointer and flags. It loads the kernel’s entry point from a special register (configured at boot — the user can’t change it). It flips the privilege bit to ring 0. It switches to the kernel stack.
Now the kernel runs. It saves all user registers. It looks up syscall number 1 in the syscall table. It calls sys_write. And then — the critical part — it validates everything.
Is fd=1 a valid, open file descriptor for this process? Is the pointer %rsi actually in user-space memory, not kernel memory? Is the byte count reasonable? Does the calling process have write permission on this descriptor?
The kernel checks all of this before it does any work. If any check fails, the syscall returns an error. The kernel never follows a user-provided pointer blindly. It never assumes a file descriptor is valid. It never trusts the byte count.
sequenceDiagram
participant U as User Program
participant CPU as CPU (SYSCALL)
participant K as Kernel
U->>U: Set registers rax=1, rdi=1, rsi=buf, rdx=6
U->>CPU: SYSCALL instruction
Note over CPU: Save user RIP + RFLAGS, switch to Ring 0, load kernel entry point
CPU->>K: Enter kernel at syscall handler
Note over K: Save all user registers, look up syscall table[1], call sys_write()
rect rgb(233, 69, 96)
Note over K: VALIDATE: Is fd valid? Is buf in userspace? Is count reasonable? Has write permission?
end
K->>K: Write bytes to device driver
K->>CPU: Return value in %rax
Note over CPU: Restore user registers, switch to Ring 3
CPU->>U: Continue execution
The trust invariant: the kernel never trusts data from userspace. Every pointer is bounds-checked. Every file descriptor is validated. Every argument is verified before use.
If the kernel skipped this — if it dereferenced a user-provided pointer without checking — a malicious program could pass a kernel address and trick the kernel into overwriting its own data structures. This single class of bugs (missing validation of userspace pointers) has been responsible for dozens of Linux privilege escalation exploits.
Why Not Just Let Programs Talk to Hardware?
Early computers did exactly this. In the 1950s batch processing era, one program ran at a time with full control of the machine. No protection. No isolation. No system calls. The program was the OS.
This worked fine until someone wanted to run two programs. Without kernel mediation:
- Two programs could write to the same disk sector simultaneously — data corruption
- A buggy program could overwrite another program’s memory — silent state corruption
- A crashed program could leave hardware in an undefined state — the next program inherits chaos
- A malicious program could read anything in memory — passwords, keys, other users’ data
graph TB
subgraph Without["Without Protection"]
direction TB
A1["Program A"] -->|"direct write"| D1["Disk Sector 42"]
B1["Program B"] -->|"direct write"| D1
D1 --> CORRUPT["Data Corruption"]
end
subgraph With["With Protection"]
direction TB
A2["Program A"] -->|"write()"| OS["OS Kernel"]
B2["Program B"] -->|"write()"| OS
OS --> D2["Disk"]
D2 --> SAFE["Safe"]
end
style Without fill:#4a0000,stroke:#e94560,color:#fff
style With fill:#003a1a,stroke:#16c79a,color:#fff
style CORRUPT fill:#e94560,stroke:#e94560,color:#fff
style SAFE fill:#16c79a,stroke:#16c79a,color:#fff
The privilege boundary exists because sharing hardware without mediation is chaos. The moment you have two programs on one machine, you need an arbiter. That arbiter needs authority they can’t override. Hardware privilege levels give the kernel that authority.
The Cost, and How to Minimize It
System calls are expensive. High-performance systems go to extraordinary lengths to reduce crossings:
Buffered I/O — println! in Rust doesn’t call write() per character. It collects output in a user-space buffer and flushes with a single syscall. One crossing instead of hundreds.
Memory-mapped files — mmap a file into your address space once, then read and write through normal memory operations. The page fault handler does the I/O transparently. No explicit read()/write() syscalls at all.
io_uring — Linux’s newest I/O interface. You submit batches of I/O requests through a shared-memory ring buffer. The kernel processes them and posts completions to another ring. At steady state, you can do thousands of I/O operations with zero syscalls per operation.
1
2
3
4
5
6
7
8
// The evolution of "read from disk" in terms of syscall overhead:
//
// Traditional: one read() syscall per operation
// Buffered: one read() per buffer-full (e.g., 4KB at a time)
// mmap: zero read() calls — page faults handle it
// io_uring: zero syscalls at steady state — shared memory rings
//
// Each step reduces boundary crossings while respecting the boundary.
Every one of these optimizations works around the cost of crossing — while still respecting the invariant. You can minimize crossings. You cannot bypass the boundary.
The Invariant, Restated
Let’s be precise about what the kernel promises at the privilege boundary:
User code cannot execute privileged instructions. Hardware enforces this — not software, not convention, not trust. The CPU faults on violations.
The only way to enter the kernel is through a controlled trap. System calls, hardware interrupts, processor exceptions. All enter through entry points the kernel configured at boot. The user cannot redirect them.
The kernel never trusts values from userspace. Every argument is validated. Every pointer is bounds-checked. Every file descriptor is verified. Violations return errors — they never cause kernel-side corruption.
After the kernel returns, user state is fully restored. The program cannot tell it was interrupted, except by elapsed time. Registers, flags, stack — all restored exactly.
When this invariant holds, a buggy program crashes itself but not the system. A malicious program wastes its own CPU time but cannot read another process’s data. The kernel is a fortress, and the system call is the guarded gate.
When this invariant breaks — through a kernel bug, a speculative execution side channel, or a misconfigured permission — the consequences are catastrophic. We’ll see exactly how catastrophic in the final post of this series, when we examine Spectre and Meltdown: attacks that violated the memory isolation invariant through the CPU itself.
Next up: the process — the kernel’s most fundamental abstraction. How the OS creates the illusion that every program owns the entire machine, and what data structures it maintains to keep that illusion intact.
This is Post 2 of the series Invariants the Kernel Keeps — operating systems through the guarantees the kernel makes, and what happens when they break.
Stay in the loop
Subscribe via RSS to get new posts on systems, Rust, and cryptography.
Subscribe to RSS