The Scheduler: Fairness on a Finite Machine
Sixty processes want the CPU. Four cores exist. The scheduler decides who runs, for how long, and what happens when everyone wants more. This post names the invariant it must keep.
In the last post, we built the process — the kernel’s most fundamental illusion. Private memory. Private registers. A CPU that seems dedicated. But we left the hardest question unanswered: when sixty processes want to run and four cores exist, who gets the CPU?
That’s the scheduler’s job. And it’s harder than it sounds, because the scheduler must satisfy goals that directly contradict each other.
The Scheduling Problem
At any moment, the system has some processes in the ready state — they could run but don’t have a CPU. The scheduler picks which ready process runs next, how long it runs before being preempted, and in what order others get their turn.
This seems like a simple queue problem. It isn’t. Different processes have different needs:
- An interactive shell needs low response time — the user typed a key and expects feedback in milliseconds
- A video encoder needs high throughput — process as many frames as possible per second
- A database query needs predictable latency — don’t spike from 1 ms to 100 ms randomly
- A background backup needs to run eventually without starving anyone else
graph TB
subgraph Ready["Ready Queue"]
direction LR
P1["Shell\n(wants: low latency)"]
P2["Compiler\n(wants: throughput)"]
P3["Database\n(wants: predictability)"]
P4["Backup\n(wants: any CPU time)"]
end
SCHED["Scheduler"] -->|"picks one"| CPU["CPU Core"]
Ready --> SCHED
style SCHED fill:#e94560,stroke:#e94560,color:#fff
style CPU fill:#16c79a,stroke:#16c79a,color:#fff
No scheduling algorithm can optimize all of these simultaneously. Every policy is a tradeoff. The scheduler’s job is to find the least-bad compromise — and to guarantee one absolute invariant.
The fairness invariant: every runnable process eventually gets CPU time. No process starves.
“Eventually” is doing heavy lifting in that sentence. It doesn’t mean “immediately.” It doesn’t mean “within 10 ms.” It means the scheduler will not forget about you. Given enough time, every ready process runs. Violate this and you get starvation — a process that is runnable but never scheduled. It sits in the ready queue forever, waiting for a turn that never comes.
Preemption: Taking the CPU Away
Early operating systems used cooperative scheduling. A process ran until it voluntarily yielded the CPU — by calling yield() or making a blocking system call. This works only if every process cooperates. One infinite loop and the entire system freezes.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Cooperative scheduling — the process decides when to stop
fn cooperative_task() {
loop {
do_some_work();
yield_cpu(); // "I'm done for now, someone else can run"
}
}
// What if a process never yields?
fn malicious_task() {
loop {
// I'm keeping the CPU forever.
// Nothing the OS can do about it.
// Every other process starves.
}
}
Modern systems use preemptive scheduling. A hardware timer fires periodically (every 1-10 ms), generating an interrupt. The interrupt transfers control to the kernel, which can forcibly stop the running process and switch to another. The process doesn’t get a vote.
sequenceDiagram
participant P as Process A (Running)
participant T as Timer Interrupt
participant K as Kernel Scheduler
participant Q as Process B (Ready)
Note over P: Running on CPU...
T->>K: Timer interrupt (every ~4 ms)
K->>K: Save Process A's registers
K->>K: Decision: A's time slice expired?
alt Time slice expired
K->>K: Move A to ready queue
K->>Q: Restore B's registers, switch to B
Note over Q: B is now Running
else Time slice remaining
K->>P: Resume A
end
The preemption invariant: no user process can hold the CPU indefinitely. The hardware timer guarantees the kernel regains control periodically, regardless of what user code does.
This is why your system stays responsive even when a program enters an infinite loop. The loop runs for a few milliseconds, the timer fires, the kernel takes over, and other processes get their turn. The looping process still runs — it just shares.
Scheduling Policies
Every scheduling algorithm answers two questions: who runs next, and for how long?
FIFO (First In, First Out)
The simplest policy. Processes run in the order they arrive. Each process runs to completion (or until it blocks).
1
2
3
4
5
6
7
8
9
10
11
12
// FIFO: run each process until it finishes
// Process A: needs 100 ms
// Process B: needs 10 ms (arrives at t=0)
// Process C: needs 5 ms (arrives at t=0)
//
// Order: A (0-100ms), B (100-110ms), C (110-115ms)
//
// Turnaround times: A=100, B=110, C=115
// Average: 108 ms
//
// Problem: B and C wait 100ms for A to finish.
// This is called the "convoy effect."
FIFO is fair in the sense that it processes in order. But it’s terrible for response time. A short job behind a long job waits forever. No production OS uses pure FIFO for general scheduling.
Round Robin
Give each process a fixed time slice (quantum) — say 10 ms. When the slice expires, preempt the process and put it at the back of the queue. Cycle through everyone.
graph LR
subgraph RR["Round Robin (10ms quantum)"]
direction LR
T0["0-10ms\nA runs"] --> T1["10-20ms\nB runs"]
T1 --> T2["20-30ms\nC runs"]
T2 --> T3["30-40ms\nA runs"]
T3 --> T4["40-50ms\nB finishes"]
T4 --> T5["50-55ms\nC finishes"]
T5 --> T6["55-145ms\nA finishes"]
end
style RR fill:#1a1a2e,stroke:#e94560,color:#fff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Round Robin with 10ms quantum:
// Process A: needs 100 ms
// Process B: needs 10 ms
// Process C: needs 5 ms
//
// t=0: A runs (10ms slice)
// t=10: B runs (10ms slice — finishes at t=20)
// t=20: C runs (5ms — finishes at t=25)
// t=25: A runs (remaining 90ms, alone on CPU)
// t=115: A finishes
//
// Turnaround: A=115, B=20, C=25
// Average: 53 ms (vs 108 ms for FIFO)
//
// B and C finish much sooner. A finishes slightly later.
// That's the tradeoff: better responsiveness, slightly worse throughput.
The quantum size matters enormously:
- Too large (100 ms): degenerates into FIFO. Long waits between turns.
- Too small (0.1 ms): too many context switches. The CPU spends more time switching than computing.
- Sweet spot (1-10 ms): balance between responsiveness and overhead.
Multi-Level Feedback Queue (MLFQ)
The real world is more nuanced. Some processes are interactive (short CPU bursts, lots of I/O waits). Others are compute-bound (long CPU bursts, rare I/O). The scheduler should treat them differently — but it doesn’t know in advance which is which.
MLFQ solves this with multiple priority queues and feedback. New processes start at the highest priority. If a process uses its entire time slice (compute-bound behavior), it drops to a lower priority. If a process voluntarily yields before its slice expires (I/O-bound behavior), it stays at high priority.
graph TB
subgraph MLFQ["Multi-Level Feedback Queue"]
Q0["Queue 0 (highest priority)\n8ms quantum"]
Q1["Queue 1\n16ms quantum"]
Q2["Queue 2 (lowest priority)\n32ms quantum"]
end
NEW["New Process"] --> Q0
Q0 -->|"used full slice"| Q1
Q1 -->|"used full slice"| Q2
Q0 -->|"yielded early (I/O)"| Q0
Q1 -->|"yielded early (I/O)"| Q1
SCHED["Scheduler: always run highest non-empty queue"]
style Q0 fill:#16c79a,stroke:#16c79a,color:#fff
style Q1 fill:#c77a30,stroke:#c77a30,color:#fff
style Q2 fill:#e94560,stroke:#e94560,color:#fff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// MLFQ in action:
//
// Shell (interactive):
// - Runs for 1ms (processes keystroke), blocks on I/O
// - Stays in Queue 0 — always high priority
// - Gets scheduled immediately when I/O completes
//
// Compiler (CPU-bound):
// - Runs for full 8ms slice in Queue 0
// - Demoted to Queue 1, runs for 16ms
// - Demoted to Queue 2, runs with 32ms slices
// - Still gets CPU time — just lower priority
//
// Result: interactive tasks are snappy, batch tasks still make progress.
MLFQ has one remaining problem: a compute-bound process can get permanently stuck in the lowest queue. If enough interactive processes keep arriving, the compute-bound process starves. The fix: priority boost. Periodically (every second or so), move all processes back to the highest queue. This prevents starvation and lets the scheduler re-learn each process’s behavior.
CFS: Linux’s Completely Fair Scheduler
Linux doesn’t use MLFQ. Since 2007, it uses the Completely Fair Scheduler (CFS) — built on a different idea entirely. Instead of priority queues, CFS tracks how much CPU time each process has consumed and always runs the process that has received the least.
CFS maintains a red-black tree keyed by virtual runtime (vruntime) — a weighted measure of how much CPU time a process has used. The process with the smallest vruntime is always the leftmost node. That’s who runs next.
graph TB
subgraph RBTree["CFS Red-Black Tree (sorted by vruntime)"]
direction TB
N1["Process C\nvruntime: 50ms"]
N2["Process A\nvruntime: 80ms"]
N3["Process B\nvruntime: 120ms"]
N2 --> N1
N2 --> N3
end
PICK["Next to run: Process C (smallest vruntime)"]
N1 --> PICK
style N1 fill:#16c79a,stroke:#16c79a,color:#fff
style PICK fill:#16c79a,stroke:#16c79a,color:#fff
style N2 fill:#1a1a2e,stroke:#e94560,color:#fff
style N3 fill:#1a1a2e,stroke:#e94560,color:#fff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// CFS simplified:
struct CFSScheduler {
// Red-black tree of runnable processes, keyed by vruntime
tree: BTreeMap<u64, Process>,
}
impl CFSScheduler {
fn pick_next(&self) -> &Process {
// Always pick the process with the smallest vruntime
// O(log n) — but the leftmost node is cached, so O(1) in practice
self.tree.iter().next().unwrap().1
}
fn update_after_running(&mut self, proc: &mut Process, ran_for: u64) {
// Increase vruntime by how long the process ran
// Higher-priority processes accumulate vruntime slower (weighted)
proc.vruntime += ran_for * NICE_TO_WEIGHT[proc.nice];
}
}
The beauty of CFS: fairness falls out of the math. If two processes have equal weight, they’ll accumulate vruntime at the same rate and alternate on the CPU equally. If one has higher priority (lower nice value), its vruntime increases slower, so it gets scheduled more often. No queues. No quantum tuning. Just “run whoever has been served the least.”
CFS invariant: over any sufficiently long interval, each process receives CPU time proportional to its weight. A process with twice the weight gets twice the time. A process with equal weight gets equal time. And no process’s vruntime falls so far behind that it never catches up — which means no starvation.
Priority Inversion: When Fairness Breaks
The most famous scheduling bug in history happened on Mars.
In 1997, the Mars Pathfinder lander repeatedly reset itself — the watchdog timer kept firing because a high-priority task couldn’t run. The cause: priority inversion.
Three tasks were involved:
- High priority: the bus management task (must run frequently)
- Medium priority: a communication task
- Low priority: a meteorological data collector
The low-priority task held a shared mutex. The high-priority task needed that mutex. So the high-priority task blocked, waiting for the low-priority task to release the lock. But the medium-priority task — which didn’t need the lock at all — kept preempting the low-priority task. The low-priority task couldn’t run, so it couldn’t release the lock, so the high-priority task couldn’t run.
sequenceDiagram
participant H as High Priority (blocked)
participant M as Medium Priority (running)
participant L as Low Priority (holds lock)
L->>L: Acquires mutex
H->>L: Needs mutex — BLOCKS
Note over H: High priority is now waiting for Low
M->>M: Runs (preempts Low)
Note over L: Low can't release mutex — it's preempted by Medium
Note over H: High is starved — it has highest priority but can't run
Note over H,L: Solution: Priority Inheritance
Note over L: Temporarily boost Low to High's priority
L->>L: Runs at high priority, releases mutex
H->>H: Acquires mutex, runs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Priority inversion in pseudocode:
//
// Low acquires lock(mutex)
// High tries to acquire lock(mutex) → BLOCKED (waiting for Low)
// Medium starts running (higher priority than Low)
// Medium runs... and runs... and runs...
// Low never gets CPU to release the lock
// High never gets the lock
// Watchdog fires → system reset
//
// Fix: priority inheritance
// When High blocks on a lock held by Low,
// temporarily boost Low's priority to High's level.
// Low preempts Medium, releases the lock, priority drops back.
// High acquires the lock and runs.
The Pathfinder team fixed it by enabling priority inheritance — a mechanism where a low-priority lock holder temporarily inherits the priority of the highest-priority waiter. The fix was uploaded to Mars.
The bounded inversion invariant: a high-priority process blocked on a resource held by a lower-priority process must not wait longer than the time it takes the holder to release the resource. Priority inheritance enforces this by preventing medium-priority processes from indefinitely preempting the lock holder.
Measuring the Scheduler
How do you know if the scheduler is doing its job? Three metrics:
Turnaround time — how long from process creation to completion. Lower is better for batch workloads.
Response time — how long from a process becoming ready to when it first runs. Lower is better for interactive workloads.
Fairness — how evenly CPU time is distributed. Jain’s fairness index ranges from 0 (completely unfair) to 1 (perfectly fair).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// The tradeoff triangle:
//
// Turnaround
// /\
// / \
// / \
// /______\
// Response Fairness
//
// FIFO: Great turnaround (no switching overhead)
// Terrible response time (convoy effect)
//
// Round Robin: Great response time (everyone gets turns)
// Worse turnaround (context switch overhead)
//
// CFS: Great fairness (proportional sharing)
// Good response time (low vruntime = scheduled soon)
// Decent turnaround (weight-based priorities)
No scheduler wins all three. CFS comes closest to a balanced compromise, which is why it’s been Linux’s default for nearly two decades.
Putting It Together
The scheduler is the kernel’s answer to an impossible problem: make finite CPUs feel infinite. Its tools are simple — queues, timers, context switches — but the tradeoffs are deep.
| Policy | Strength | Weakness |
|---|---|---|
| FIFO | Simple, no switching overhead | Convoy effect, poor interactivity |
| Round Robin | Fair, responsive | Context switch overhead, no priority |
| MLFQ | Adapts to behavior, prioritizes I/O | Complex, needs anti-starvation boost |
| CFS | Mathematically fair, simple model | Red-black tree overhead, less tunable |
The invariant that unifies them all:
Every runnable process eventually gets CPU time. The scheduler may delay you. It may give you less than you want. But it will never forget you exist. Starvation is the one failure mode that no correct scheduler permits.
Next up: virtual memory — how the kernel gives every process its own address space, translates virtual addresses to physical ones through page tables, and handles the moment when a process touches memory that isn’t there.
This is Post 4 of the series Invariants the Kernel Keeps — operating systems through the guarantees the kernel makes, and what happens when they break.
Stay in the loop
Subscribe via RSS to get new posts on systems, Rust, and cryptography.
Subscribe to RSS