Post

The Scheduler: Fairness on a Finite Machine

Sixty processes want the CPU. Four cores exist. The scheduler decides who runs, for how long, and what happens when everyone wants more. This post names the invariant it must keep.

The Scheduler: Fairness on a Finite Machine

In the last post, we built the process — the kernel’s most fundamental illusion. Private memory. Private registers. A CPU that seems dedicated. But we left the hardest question unanswered: when sixty processes want to run and four cores exist, who gets the CPU?

That’s the scheduler’s job. And it’s harder than it sounds, because the scheduler must satisfy goals that directly contradict each other.

The Scheduling Problem

At any moment, the system has some processes in the ready state — they could run but don’t have a CPU. The scheduler picks which ready process runs next, how long it runs before being preempted, and in what order others get their turn.

This seems like a simple queue problem. It isn’t. Different processes have different needs:

  • An interactive shell needs low response time — the user typed a key and expects feedback in milliseconds
  • A video encoder needs high throughput — process as many frames as possible per second
  • A database query needs predictable latency — don’t spike from 1 ms to 100 ms randomly
  • A background backup needs to run eventually without starving anyone else
graph TB
    subgraph Ready["Ready Queue"]
        direction LR
        P1["Shell\n(wants: low latency)"]
        P2["Compiler\n(wants: throughput)"]
        P3["Database\n(wants: predictability)"]
        P4["Backup\n(wants: any CPU time)"]
    end

    SCHED["Scheduler"] -->|"picks one"| CPU["CPU Core"]

    Ready --> SCHED

    style SCHED fill:#e94560,stroke:#e94560,color:#fff
    style CPU fill:#16c79a,stroke:#16c79a,color:#fff

No scheduling algorithm can optimize all of these simultaneously. Every policy is a tradeoff. The scheduler’s job is to find the least-bad compromise — and to guarantee one absolute invariant.

The fairness invariant: every runnable process eventually gets CPU time. No process starves.

“Eventually” is doing heavy lifting in that sentence. It doesn’t mean “immediately.” It doesn’t mean “within 10 ms.” It means the scheduler will not forget about you. Given enough time, every ready process runs. Violate this and you get starvation — a process that is runnable but never scheduled. It sits in the ready queue forever, waiting for a turn that never comes.

Preemption: Taking the CPU Away

Early operating systems used cooperative scheduling. A process ran until it voluntarily yielded the CPU — by calling yield() or making a blocking system call. This works only if every process cooperates. One infinite loop and the entire system freezes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Cooperative scheduling — the process decides when to stop
fn cooperative_task() {
    loop {
        do_some_work();
        yield_cpu();  // "I'm done for now, someone else can run"
    }
}

// What if a process never yields?
fn malicious_task() {
    loop {
        // I'm keeping the CPU forever.
        // Nothing the OS can do about it.
        // Every other process starves.
    }
}

Modern systems use preemptive scheduling. A hardware timer fires periodically (every 1-10 ms), generating an interrupt. The interrupt transfers control to the kernel, which can forcibly stop the running process and switch to another. The process doesn’t get a vote.

sequenceDiagram
    participant P as Process A (Running)
    participant T as Timer Interrupt
    participant K as Kernel Scheduler
    participant Q as Process B (Ready)

    Note over P: Running on CPU...

    T->>K: Timer interrupt (every ~4 ms)
    K->>K: Save Process A's registers
    K->>K: Decision: A's time slice expired?

    alt Time slice expired
        K->>K: Move A to ready queue
        K->>Q: Restore B's registers, switch to B
        Note over Q: B is now Running
    else Time slice remaining
        K->>P: Resume A
    end

The preemption invariant: no user process can hold the CPU indefinitely. The hardware timer guarantees the kernel regains control periodically, regardless of what user code does.

This is why your system stays responsive even when a program enters an infinite loop. The loop runs for a few milliseconds, the timer fires, the kernel takes over, and other processes get their turn. The looping process still runs — it just shares.

Scheduling Policies

Every scheduling algorithm answers two questions: who runs next, and for how long?

FIFO (First In, First Out)

The simplest policy. Processes run in the order they arrive. Each process runs to completion (or until it blocks).

1
2
3
4
5
6
7
8
9
10
11
12
// FIFO: run each process until it finishes
// Process A: needs 100 ms
// Process B: needs 10 ms  (arrives at t=0)
// Process C: needs 5 ms   (arrives at t=0)
//
// Order: A (0-100ms), B (100-110ms), C (110-115ms)
//
// Turnaround times: A=100, B=110, C=115
// Average: 108 ms
//
// Problem: B and C wait 100ms for A to finish.
// This is called the "convoy effect."

FIFO is fair in the sense that it processes in order. But it’s terrible for response time. A short job behind a long job waits forever. No production OS uses pure FIFO for general scheduling.

Round Robin

Give each process a fixed time slice (quantum) — say 10 ms. When the slice expires, preempt the process and put it at the back of the queue. Cycle through everyone.

graph LR
    subgraph RR["Round Robin (10ms quantum)"]
        direction LR
        T0["0-10ms\nA runs"] --> T1["10-20ms\nB runs"]
        T1 --> T2["20-30ms\nC runs"]
        T2 --> T3["30-40ms\nA runs"]
        T3 --> T4["40-50ms\nB finishes"]
        T4 --> T5["50-55ms\nC finishes"]
        T5 --> T6["55-145ms\nA finishes"]
    end

    style RR fill:#1a1a2e,stroke:#e94560,color:#fff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Round Robin with 10ms quantum:
// Process A: needs 100 ms
// Process B: needs 10 ms
// Process C: needs 5 ms
//
// t=0:   A runs (10ms slice)
// t=10:  B runs (10ms slice — finishes at t=20)
// t=20:  C runs (5ms — finishes at t=25)
// t=25:  A runs (remaining 90ms, alone on CPU)
// t=115: A finishes
//
// Turnaround: A=115, B=20, C=25
// Average: 53 ms (vs 108 ms for FIFO)
//
// B and C finish much sooner. A finishes slightly later.
// That's the tradeoff: better responsiveness, slightly worse throughput.

The quantum size matters enormously:

  • Too large (100 ms): degenerates into FIFO. Long waits between turns.
  • Too small (0.1 ms): too many context switches. The CPU spends more time switching than computing.
  • Sweet spot (1-10 ms): balance between responsiveness and overhead.

Multi-Level Feedback Queue (MLFQ)

The real world is more nuanced. Some processes are interactive (short CPU bursts, lots of I/O waits). Others are compute-bound (long CPU bursts, rare I/O). The scheduler should treat them differently — but it doesn’t know in advance which is which.

MLFQ solves this with multiple priority queues and feedback. New processes start at the highest priority. If a process uses its entire time slice (compute-bound behavior), it drops to a lower priority. If a process voluntarily yields before its slice expires (I/O-bound behavior), it stays at high priority.

graph TB
    subgraph MLFQ["Multi-Level Feedback Queue"]
        Q0["Queue 0 (highest priority)\n8ms quantum"]
        Q1["Queue 1\n16ms quantum"]
        Q2["Queue 2 (lowest priority)\n32ms quantum"]
    end

    NEW["New Process"] --> Q0

    Q0 -->|"used full slice"| Q1
    Q1 -->|"used full slice"| Q2
    Q0 -->|"yielded early (I/O)"| Q0
    Q1 -->|"yielded early (I/O)"| Q1

    SCHED["Scheduler: always run highest non-empty queue"]

    style Q0 fill:#16c79a,stroke:#16c79a,color:#fff
    style Q1 fill:#c77a30,stroke:#c77a30,color:#fff
    style Q2 fill:#e94560,stroke:#e94560,color:#fff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// MLFQ in action:
//
// Shell (interactive):
//   - Runs for 1ms (processes keystroke), blocks on I/O
//   - Stays in Queue 0 — always high priority
//   - Gets scheduled immediately when I/O completes
//
// Compiler (CPU-bound):
//   - Runs for full 8ms slice in Queue 0
//   - Demoted to Queue 1, runs for 16ms
//   - Demoted to Queue 2, runs with 32ms slices
//   - Still gets CPU time — just lower priority
//
// Result: interactive tasks are snappy, batch tasks still make progress.

MLFQ has one remaining problem: a compute-bound process can get permanently stuck in the lowest queue. If enough interactive processes keep arriving, the compute-bound process starves. The fix: priority boost. Periodically (every second or so), move all processes back to the highest queue. This prevents starvation and lets the scheduler re-learn each process’s behavior.

CFS: Linux’s Completely Fair Scheduler

Linux doesn’t use MLFQ. Since 2007, it uses the Completely Fair Scheduler (CFS) — built on a different idea entirely. Instead of priority queues, CFS tracks how much CPU time each process has consumed and always runs the process that has received the least.

CFS maintains a red-black tree keyed by virtual runtime (vruntime) — a weighted measure of how much CPU time a process has used. The process with the smallest vruntime is always the leftmost node. That’s who runs next.

graph TB
    subgraph RBTree["CFS Red-Black Tree (sorted by vruntime)"]
        direction TB
        N1["Process C\nvruntime: 50ms"]
        N2["Process A\nvruntime: 80ms"]
        N3["Process B\nvruntime: 120ms"]

        N2 --> N1
        N2 --> N3
    end

    PICK["Next to run: Process C (smallest vruntime)"]
    N1 --> PICK

    style N1 fill:#16c79a,stroke:#16c79a,color:#fff
    style PICK fill:#16c79a,stroke:#16c79a,color:#fff
    style N2 fill:#1a1a2e,stroke:#e94560,color:#fff
    style N3 fill:#1a1a2e,stroke:#e94560,color:#fff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// CFS simplified:
struct CFSScheduler {
    // Red-black tree of runnable processes, keyed by vruntime
    tree: BTreeMap<u64, Process>,
}

impl CFSScheduler {
    fn pick_next(&self) -> &Process {
        // Always pick the process with the smallest vruntime
        // O(log n) — but the leftmost node is cached, so O(1) in practice
        self.tree.iter().next().unwrap().1
    }

    fn update_after_running(&mut self, proc: &mut Process, ran_for: u64) {
        // Increase vruntime by how long the process ran
        // Higher-priority processes accumulate vruntime slower (weighted)
        proc.vruntime += ran_for * NICE_TO_WEIGHT[proc.nice];
    }
}

The beauty of CFS: fairness falls out of the math. If two processes have equal weight, they’ll accumulate vruntime at the same rate and alternate on the CPU equally. If one has higher priority (lower nice value), its vruntime increases slower, so it gets scheduled more often. No queues. No quantum tuning. Just “run whoever has been served the least.”

CFS invariant: over any sufficiently long interval, each process receives CPU time proportional to its weight. A process with twice the weight gets twice the time. A process with equal weight gets equal time. And no process’s vruntime falls so far behind that it never catches up — which means no starvation.

Priority Inversion: When Fairness Breaks

The most famous scheduling bug in history happened on Mars.

In 1997, the Mars Pathfinder lander repeatedly reset itself — the watchdog timer kept firing because a high-priority task couldn’t run. The cause: priority inversion.

Three tasks were involved:

  • High priority: the bus management task (must run frequently)
  • Medium priority: a communication task
  • Low priority: a meteorological data collector

The low-priority task held a shared mutex. The high-priority task needed that mutex. So the high-priority task blocked, waiting for the low-priority task to release the lock. But the medium-priority task — which didn’t need the lock at all — kept preempting the low-priority task. The low-priority task couldn’t run, so it couldn’t release the lock, so the high-priority task couldn’t run.

sequenceDiagram
    participant H as High Priority (blocked)
    participant M as Medium Priority (running)
    participant L as Low Priority (holds lock)

    L->>L: Acquires mutex
    H->>L: Needs mutex — BLOCKS

    Note over H: High priority is now waiting for Low

    M->>M: Runs (preempts Low)
    Note over L: Low can't release mutex — it's preempted by Medium
    Note over H: High is starved — it has highest priority but can't run

    Note over H,L: Solution: Priority Inheritance
    Note over L: Temporarily boost Low to High's priority
    L->>L: Runs at high priority, releases mutex
    H->>H: Acquires mutex, runs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Priority inversion in pseudocode:
//
// Low acquires lock(mutex)
// High tries to acquire lock(mutex) → BLOCKED (waiting for Low)
// Medium starts running (higher priority than Low)
// Medium runs... and runs... and runs...
// Low never gets CPU to release the lock
// High never gets the lock
// Watchdog fires → system reset
//
// Fix: priority inheritance
// When High blocks on a lock held by Low,
// temporarily boost Low's priority to High's level.
// Low preempts Medium, releases the lock, priority drops back.
// High acquires the lock and runs.

The Pathfinder team fixed it by enabling priority inheritance — a mechanism where a low-priority lock holder temporarily inherits the priority of the highest-priority waiter. The fix was uploaded to Mars.

The bounded inversion invariant: a high-priority process blocked on a resource held by a lower-priority process must not wait longer than the time it takes the holder to release the resource. Priority inheritance enforces this by preventing medium-priority processes from indefinitely preempting the lock holder.

Measuring the Scheduler

How do you know if the scheduler is doing its job? Three metrics:

Turnaround time — how long from process creation to completion. Lower is better for batch workloads.

Response time — how long from a process becoming ready to when it first runs. Lower is better for interactive workloads.

Fairness — how evenly CPU time is distributed. Jain’s fairness index ranges from 0 (completely unfair) to 1 (perfectly fair).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// The tradeoff triangle:
//
//        Turnaround
//           /\
//          /  \
//         /    \
//        /______\
// Response     Fairness
//
// FIFO:         Great turnaround (no switching overhead)
//               Terrible response time (convoy effect)
//
// Round Robin:  Great response time (everyone gets turns)
//               Worse turnaround (context switch overhead)
//
// CFS:          Great fairness (proportional sharing)
//               Good response time (low vruntime = scheduled soon)
//               Decent turnaround (weight-based priorities)

No scheduler wins all three. CFS comes closest to a balanced compromise, which is why it’s been Linux’s default for nearly two decades.

Putting It Together

The scheduler is the kernel’s answer to an impossible problem: make finite CPUs feel infinite. Its tools are simple — queues, timers, context switches — but the tradeoffs are deep.

PolicyStrengthWeakness
FIFOSimple, no switching overheadConvoy effect, poor interactivity
Round RobinFair, responsiveContext switch overhead, no priority
MLFQAdapts to behavior, prioritizes I/OComplex, needs anti-starvation boost
CFSMathematically fair, simple modelRed-black tree overhead, less tunable

The invariant that unifies them all:

Every runnable process eventually gets CPU time. The scheduler may delay you. It may give you less than you want. But it will never forget you exist. Starvation is the one failure mode that no correct scheduler permits.

Next up: virtual memory — how the kernel gives every process its own address space, translates virtual addresses to physical ones through page tables, and handles the moment when a process touches memory that isn’t there.


This is Post 4 of the series Invariants the Kernel Keeps — operating systems through the guarantees the kernel makes, and what happens when they break.

This post is licensed under CC BY 4.0 by the author.