Post

The Buffer Pool: Why Databases Don't Trust the OS

The OS has a page cache. The database ignores it and builds its own. This post explains why — and how the buffer pool decides what stays in RAM.

The Buffer Pool: Why Databases Don't Trust the OS

In the last post, we established the page as the atom of database I/O. Every read loads a page. Every write flushes a page. The question now is: do you really want to go to disk on every query?

A single SELECT might touch dozens of pages — index pages to navigate the B-Tree, data pages to fetch rows. If each page read costs 0.1 ms on an SSD (or 10 ms on a spinning disk), a simple query could take milliseconds of pure I/O. Run a thousand queries per second and you’re spending all your time waiting for the disk.

The solution is obvious: keep hot pages in memory. The non-obvious part is who manages that memory. The OS already has a page cache. It already caches disk blocks in RAM. Why not just use it?

Because the OS doesn’t know what you’re about to do next. The database does.

Why Not the OS Page Cache?

The OS page cache is a general-purpose LRU cache. It caches recently accessed disk blocks and evicts the least recently used ones when memory is full. For most applications, this works well enough. For databases, it’s wrong in three specific ways.

The double-buffering problem. If the database reads a page through normal file I/O, the OS caches it in the kernel page cache. The database then copies it into its own buffer pool. Same page, two copies, twice the memory. The OS doesn’t know the database already has it cached. The database doesn’t know the OS has it cached. Both hold on to it. Memory wasted.

The eviction problem. The OS uses a global LRU policy across all files, all applications. A log file being tailed, a config file being read, a build artifact being written — all compete for the same cache. The OS might evict a hot index page to cache a log file you’ll never read again. The database knows which pages are hot. The OS doesn’t.

The flush control problem. The database needs precise control over when dirty pages are written to disk. The WAL protocol demands it: the WAL record must hit disk before the dirty data page. The OS page cache flushes dirty pages whenever it wants — in whatever order it wants. The database can’t enforce WAL ordering through the OS cache without fsync on every write, which kills performance.

graph TB
    subgraph OS_Way["Using OS Page Cache"]
        direction TB
        APP1["Database"] -->|"read()"| KERNEL["Kernel Page Cache"]
        APP1 -->|"copy"| BUF1["Database Buffer (redundant)"]
        KERNEL --> DISK1["Disk"]
        NOTE1["Two copies of same page in RAM"]
    end

    subgraph DB_Way["Using Direct I/O + Buffer Pool"]
        direction TB
        APP2["Database"] -->|"O_DIRECT"| BUF2["Buffer Pool"]
        BUF2 --> DISK2["Disk"]
        NOTE2["One copy. Database controls everything."]
    end

    style OS_Way fill:#4a0000,stroke:#e94560,color:#fff
    style DB_Way fill:#003a1a,stroke:#16c79a,color:#fff

Most serious databases bypass the OS page cache entirely using O_DIRECT (Linux) or equivalent flags. This tells the OS: don’t cache this file, I’ll handle it myself. PostgreSQL is a notable exception — it uses the OS page cache and its own shared buffer pool, accepting the double-buffering tradeoff for portability. But even PostgreSQL controls flush ordering through careful fsync calls.

The buffer pool invariant: the database manages its own page cache, with full control over what stays in memory, what gets evicted, and when dirty pages are flushed.

The Buffer Pool

The buffer pool is an array of page-sized slots in memory — called frames. Each frame can hold exactly one page. A hash table maps page IDs to frame numbers. When the query executor needs a page, it asks the buffer pool. If the page is already in a frame (a hit), it’s returned immediately — no I/O. If it’s not (a miss), the buffer pool reads it from disk into a frame and then returns it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
const BUFFER_SIZE: usize = 1024; // number of frames (= pages in memory)
const PAGE_SIZE: usize = 8192;

struct BufferPool {
    frames: Vec<[u8; PAGE_SIZE]>,          // the actual page data
    page_table: HashMap<u32, usize>,       // page_id → frame index
    dirty: Vec<bool>,                      // has this frame been modified?
    pin_count: Vec<u32>,                   // how many users are reading this frame?
    usage: Vec<u32>,                       // access counter for eviction
}

impl BufferPool {
    fn fetch_page(&mut self, page_id: u32) -> &[u8; PAGE_SIZE] {
        // Hit: page is already in memory
        if let Some(&frame) = self.page_table.get(&page_id) {
            self.pin_count[frame] += 1;
            self.usage[frame] += 1;
            return &self.frames[frame];
        }

        // Miss: need to load from disk
        let frame = self.find_victim();        // pick a frame to evict
        self.evict(frame);                     // flush if dirty, remove mapping
        self.frames[frame] = disk_read(page_id); // load new page
        self.page_table.insert(page_id, frame);
        self.pin_count[frame] = 1;
        self.dirty[frame] = false;
        self.usage[frame] = 1;
        &self.frames[frame]
    }
}
sequenceDiagram
    participant Q as Query Executor
    participant BP as Buffer Pool
    participant D as Disk

    Q->>BP: fetch_page(42)

    alt Page 42 in buffer pool (HIT)
        BP->>BP: pin frame, bump usage
        BP->>Q: return page data
        Note over D: No disk I/O
    else Page 42 not in buffer pool (MISS)
        BP->>BP: find victim frame
        BP->>D: flush victim (if dirty)
        BP->>D: read page 42
        D->>BP: page data
        BP->>BP: store in frame, pin, map page_id
        BP->>Q: return page data
    end

The buffer pool hit rate is the single most important performance metric in a database. A hit rate of 99% means 1 in 100 page requests goes to disk. At 95%, it’s 1 in 20 — five times more I/O. Production databases aim for 99%+ hit rates, and the buffer pool size is the primary tuning knob.

Pin and Unpin

A page in the buffer pool can’t be evicted while someone is using it. If the query executor is reading page 42 and the buffer pool evicts it mid-read, the executor gets garbage data.

The solution: pin counting. When a component fetches a page, the pin count increments. When it’s done, it unpins the page (decrements). A page with pin_count > 0 is immune to eviction. A page with pin_count == 0 is eligible.

1
2
3
4
5
6
7
8
9
10
11
12
13
// Using a page safely:
fn process_query(pool: &mut BufferPool) {
    let page = pool.fetch_page(42);   // pin_count: 0 → 1
    // ... read data from page ...
    pool.unpin_page(42);              // pin_count: 1 → 0
    // Page 42 is now evictable again
}

// A page pinned by two concurrent queries:
// Query A: fetch_page(42)   → pin_count: 1
// Query B: fetch_page(42)   → pin_count: 2
// Query A: unpin_page(42)   → pin_count: 1 (still pinned!)
// Query B: unpin_page(42)   → pin_count: 0 (now evictable)

The pin invariant: a page with pin_count > 0 is never evicted. The buffer pool guarantees that a pinned page remains valid in memory until every user unpins it.

A pin leak — forgetting to unpin a page — is the buffer pool equivalent of a memory leak. The page stays pinned forever, consuming a frame that can never be reused. Enough pin leaks and the buffer pool runs out of frames, stalling the entire database. This is why Rust’s ownership model is such a good fit for database internals: an RAII guard can automatically unpin on drop.

Eviction: Who Gets Kicked Out?

When the buffer pool is full and a new page needs to come in, some existing page must leave. The eviction policy decides which one. The goal: evict the page least likely to be needed again soon. Guess wrong and you cause an unnecessary disk read.

LRU (Least Recently Used)

The simplest useful policy. Maintain a linked list of all unpinned frames. Every time a page is accessed, move it to the front. When you need a victim, take from the back — the page that hasn’t been touched the longest.

graph LR
    subgraph LRU["LRU List (most recent → least recent)"]
        direction LR
        A["Page 42\n(just accessed)"] --> B["Page 7"] --> C["Page 15"] --> D["Page 3\n(evict this)"]
    end

    style D fill:#e94560,stroke:#e94560,color:#fff
    style A fill:#16c79a,stroke:#16c79a,color:#fff

LRU works well for most workloads. But it has a fatal weakness: sequential flooding. A full table scan reads every page in a table, once, in order. Each page enters the LRU, pushes out a hot page, and is never accessed again. A single sequential scan can flush the entire buffer pool of useful pages.

1
2
3
4
5
6
7
8
9
10
// Sequential scan of a 10,000-page table on a 1,000-frame buffer pool:
//
// Read page 0 → evict hot page, cache page 0
// Read page 1 → evict hot page, cache page 1
// ...
// Read page 999 → buffer pool is now 100% scan pages
// Read page 1000 → evict page 0 (will never be reused anyway)
// ...
// After scan: every hot page is gone. Hit rate drops to zero.
// The scan gained nothing — it read each page exactly once.

Clock (Second-Chance)

An approximation of LRU that’s cheaper and resists sequential flooding. Each frame has a reference bit. When a page is accessed, set the bit to 1. When looking for a victim, sweep through frames in a circle (like a clock hand):

  • If the reference bit is 1: clear it to 0, move on (second chance)
  • If the reference bit is 0: evict this page
graph TB
    subgraph Clock["Clock Sweep"]
        direction TB
        F0["Frame 0\nref=1"] --> F1["Frame 1\nref=0 ← EVICT"]
        F1 --> F2["Frame 2\nref=1"]
        F2 --> F3["Frame 3\nref=1"]
        F3 --> F0
    end

    HAND["Clock Hand 🕐"] --> F1

    style F1 fill:#e94560,stroke:#e94560,color:#fff
    style F0 fill:#16c79a,stroke:#16c79a,color:#fff
    style F2 fill:#16c79a,stroke:#16c79a,color:#fff
    style F3 fill:#16c79a,stroke:#16c79a,color:#fff

PostgreSQL uses a clock-based eviction policy. Pages accessed repeatedly keep getting their reference bit set back to 1, so they survive multiple sweeps. Pages from a sequential scan are accessed once, get one reference bit, and are evicted on the next sweep. The hot pages survive. The scan pages don’t.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
fn find_victim(&mut self) -> usize {
    loop {
        let frame = self.clock_hand;
        self.clock_hand = (self.clock_hand + 1) % self.frames.len();

        if self.pin_count[frame] > 0 {
            continue; // pinned — skip
        }

        if self.usage[frame] > 0 {
            self.usage[frame] -= 1; // second chance
            continue;
        }

        return frame; // usage == 0, unpinned — evict
    }
}

LRU-K and Other Policies

LRU-K tracks the last K accesses to each page and evicts the page whose K-th most recent access is oldest. LRU-2 (K=2) effectively separates “accessed once” from “accessed multiple times,” solving the sequential scan problem more precisely than clock.

MySQL’s InnoDB uses a variant: the buffer pool is split into a “young” and “old” sublist. New pages enter the old sublist. If they’re accessed again, they promote to the young sublist. A sequential scan fills the old sublist and gets evicted without disturbing the young (hot) pages.

Dirty Pages and the Checkpoint

A dirty page is one that’s been modified in the buffer pool but not yet written to disk. The buffer pool tracks which frames are dirty. Dirty pages can’t be simply discarded on eviction — they must be flushed to disk first, or the modification is lost.

But you can’t flush dirty pages whenever you want. The WAL protocol demands:

A dirty page cannot be flushed to the data file until its WAL record is on stable storage.

This is the WAL-first rule we covered in the pages post. Each page carries a Log Sequence Number (LSN) — the position of the most recent WAL record that modified it. Before flushing a dirty page, the buffer pool checks: has the WAL been flushed up to this LSN? If not, flush the WAL first.

sequenceDiagram
    participant BP as Buffer Pool
    participant WAL as WAL (disk)
    participant DF as Data File (disk)

    Note over BP: Dirty page 42, LSN = 500

    BP->>WAL: Is WAL flushed to LSN 500?

    alt WAL flushed to LSN ≥ 500
        BP->>DF: Write page 42
        BP->>BP: Mark page 42 clean
    else WAL only flushed to LSN 450
        BP->>WAL: Flush WAL to LSN 500
        WAL-->>BP: Done
        BP->>DF: Write page 42
        BP->>BP: Mark page 42 clean
    end

A checkpoint is a periodic operation where the database flushes all dirty pages to disk and records the current WAL position. After a checkpoint, recovery only needs to replay WAL records after the checkpoint LSN — not the entire log. Without checkpoints, the WAL grows forever and recovery takes longer and longer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
fn checkpoint(&mut self) {
    let checkpoint_lsn = self.wal.current_lsn();

    // Flush all dirty pages
    for frame in 0..self.frames.len() {
        if self.dirty[frame] {
            // Ensure WAL is flushed past this page's LSN
            let page_lsn = self.get_page_lsn(frame);
            self.wal.flush_to(page_lsn);

            self.flush_to_disk(frame);
            self.dirty[frame] = false;
        }
    }

    // Record checkpoint position
    self.wal.write_checkpoint_record(checkpoint_lsn);

    // WAL before checkpoint_lsn can now be recycled
}

The Numbers

Buffer pool sizing is the most impactful tuning decision in database administration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// PostgreSQL shared_buffers examples:
//
// Buffer pool: 256 MB (default-ish)
// Pages cached: 256 MB / 8 KB = 32,768 pages
//
// Buffer pool: 8 GB (production server)
// Pages cached: 8 GB / 8 KB = 1,048,576 pages (~1 million)
//
// If your working set (hot pages) fits in the buffer pool:
//   → 99%+ hit rate, most queries never touch disk
//
// If your working set exceeds the buffer pool:
//   → Eviction thrashing, constant disk I/O, slow queries
//
// Rule of thumb: set shared_buffers to 25% of system RAM.
// Let the OS page cache handle the rest as a second-level cache.

Putting It Together

The buffer pool sits between every query and the disk. It is the most performance-critical component in the database.

DecisionImpact
Buffer pool sizeDetermines hit rate — the biggest performance factor
Eviction policyDetermines which pages survive under pressure
Pin disciplinePrevents eviction of in-use pages
Dirty page managementEnsures WAL ordering, enables crash recovery
Checkpoint frequencyTradeoff between recovery time and I/O overhead

The buffer pool invariant: hot pages stay in memory, dirty pages are flushed in WAL order, and pinned pages are never evicted. The database controls all three — never the OS.

Next up: B-Trees — the data structure that makes indexed lookups possible. How the database organizes millions of keys into a balanced tree where every search is O(log n), and how it keeps the tree balanced through splits and merges without violating any invariant.


This is Post 3 of the series Invariants the Storage Engine Keeps — databases through the guarantees they make about your data, and what breaks when they can’t.

This post is licensed under CC BY 4.0 by the author.