Pages: The Fixed-Size Bet Every Database Makes
Databases don't read bytes. They read pages — fixed-size blocks that turn random I/O into something the disk can handle. This post explains why.
In the last post, we named the four promises every database makes: atomicity, consistency, isolation, durability. But promises are cheap. The question is how — and the answer starts at the very bottom. Before transactions. Before indexes. Before SQL. There’s a data structure so fundamental that every database — relational, document, key-value, graph — uses it.
The page.
The Disk Problem
Your program thinks in bytes. Read this byte. Write those four bytes. But the hardware underneath doesn’t work that way.
A hard disk reads and writes in sectors — typically 512 bytes or 4 KB. You can’t read a single byte from a spinning disk. You read an entire sector, get your byte, and ignore the rest. SSDs work in pages too — typically 4 KB reads, 16 KB or larger erase blocks. The hardware forces block-level access.
Databases could fight this. They could request individual bytes and let the OS figure it out. But that would mean trusting the OS page cache to make intelligent caching decisions about data the OS knows nothing about. Instead, databases embrace it. They align their own data structures to fixed-size blocks — pages — and manage I/O in page-sized chunks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// The database doesn't do this:
fn read_row(offset: u64, len: usize) -> Vec<u8> {
// Read arbitrary bytes from disk
// Let the OS cache handle it
// Hope for the best
}
// It does this:
const PAGE_SIZE: usize = 8192; // 8 KB — PostgreSQL's default
fn read_page(page_id: u32) -> [u8; PAGE_SIZE] {
let offset = page_id as u64 * PAGE_SIZE as u64;
// Read exactly one page — aligned, predictable, cacheable
}
graph LR
subgraph Wrong["Byte-level I/O"]
direction TB
R1["Read 37 bytes at offset 1042"]
R2["Read 8 bytes at offset 9999"]
R3["Read 200 bytes at offset 51200"]
end
subgraph Right["Page-level I/O"]
direction TB
P1["Read page 0 (bytes 0–8191)"]
P2["Read page 1 (bytes 8192–16383)"]
P3["Read page 6 (bytes 49152–57343)"]
end
style Wrong fill:#4a0000,stroke:#e94560,color:#fff
style Right fill:#003a1a,stroke:#16c79a,color:#fff
Databases read pages, not bytes. Every I/O operation moves exactly one page. This alignment makes caching predictable and I/O efficient.
The page invariant: all data is organized into fixed-size pages, and all disk I/O operates on whole pages. No partial reads. No partial writes. One page in, one page out.
Why Fixed Size?
Variable-size blocks sound more flexible. Store a 50-byte row in a 50-byte block, a 5 KB row in a 5 KB block. No wasted space. But variable sizes create three problems that fixed sizes solve instantly:
Addressing. With fixed-size pages, finding page N is multiplication: offset = N * PAGE_SIZE. O(1). No index needed. With variable sizes, you need a separate map from page ID to offset — another data structure to maintain, another thing that can corrupt.
Replacement. When you update a row and the new version is larger, a variable-size block must be relocated. That means updating every pointer to it. With fixed-size pages, you write the modified page back to the same location. Same offset. No pointers change.
Buffer management. The buffer pool (in-memory cache) keeps an array of page-sized slots. Every slot fits every page. No fragmentation. No compaction. A page evicted from slot 7 is replaced by a different page in slot 7. Simple and fast.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Fixed-size pages make the buffer pool trivial:
struct BufferPool {
pages: Vec<[u8; PAGE_SIZE]>, // every slot is the same size
page_table: HashMap<u32, usize>, // page_id → slot index
}
// Evict page from slot 3, load page 42 into slot 3:
fn replace(&mut self, slot: usize, new_page_id: u32) {
if self.is_dirty(slot) {
self.flush_to_disk(slot); // write old page back
}
self.pages[slot] = self.read_from_disk(new_page_id);
self.page_table.insert(new_page_id, slot);
}
// No resizing. No compaction. No fragmentation.
The tradeoff is internal fragmentation — a page with only 100 bytes of actual data wastes the other 8,092 bytes. Databases accept this cost because the alternative (variable-size management) is far more expensive in complexity, bugs, and performance.
Anatomy of a Page
A page isn’t just a bag of bytes. It has structure. The most common layout is the slotted page — used by PostgreSQL, SQLite, MySQL (InnoDB), and most others.
graph TB
subgraph Page["Page (8 KB)"]
direction TB
H["Page Header\n(page ID, checksum, free space pointers, LSN)"]
SA["Slot Array\n(grows →)"]
FREE["Free Space"]
DATA["Row Data / Tuples\n(← grows)"]
end
SA -->|"slot 0: offset 8120"| DATA
SA -->|"slot 1: offset 7980"| DATA
SA -->|"slot 2: offset 7840"| DATA
style Page fill:#1a1a2e,stroke:#e94560,color:#fff
style H fill:#0f3460,stroke:#16213e,color:#fff
style FREE fill:#16213e,stroke:#0f3460,color:#fff
Page Header — metadata about the page itself. The page ID, a checksum (to detect corruption), a pointer to the start of free space, and the Log Sequence Number (LSN) — the position in the WAL of the last modification to this page. The LSN is critical for crash recovery: it tells the database whether this page is up-to-date or needs WAL replay.
Slot Array — a list of fixed-size pointers, growing from the top of the page downward. Each slot contains the offset and length of one row within the page. Slot 0 points to the first row, slot 1 to the second, and so on.
Row Data (Tuples) — the actual rows, packed from the bottom of the page upward. Each row contains the column values, null bitmaps, and row-level metadata (transaction visibility info for MVCC).
Free Space — the gap between the slot array and the row data. As you insert rows, the slot array grows down and the row data grows up. When they meet, the page is full.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Simplified slotted page layout
struct PageHeader {
page_id: u32,
checksum: u32,
lsn: u64, // log sequence number
num_slots: u16,
free_space_start: u16, // end of slot array
free_space_end: u16, // start of row data
}
struct Slot {
offset: u16, // where the row starts (from page bottom)
length: u16, // row size in bytes
}
// To find row N:
// 1. Read slot N from the slot array → get (offset, length)
// 2. Read `length` bytes starting at `offset`
// O(1) access to any row on the page.
Why the Slot Array Matters
The slot array exists for one reason: indirection. External structures (indexes, other pages) reference rows as (page_id, slot_number) — never as (page_id, byte_offset). This means the database can move rows within a page — to compact free space or accommodate updates — without updating any external reference.
graph LR
subgraph Index["B-Tree Index"]
REF["(page 5, slot 2)"]
end
subgraph Before["Page 5 (before compaction)"]
S1B["Slot 2 → offset 7840"]
end
subgraph After["Page 5 (after compaction)"]
S1A["Slot 2 → offset 7900"]
end
REF --> S1B
REF --> S1A
style Index fill:#0f3460,stroke:#16213e,color:#fff
style Before fill:#1a1a2e,stroke:#e94560,color:#fff
style After fill:#1a1a2e,stroke:#16c79a,color:#fff
The index says “page 5, slot 2.” It doesn’t care where on the page the row lives. The slot array handles the translation. Rows move; references don’t break.
Without indirection, every row relocation would require updating every index entry and every foreign key reference that points to it. With a table that has five indexes, moving one row means five index updates. With the slot array, moving a row means updating one 4-byte slot entry. The difference scales catastrophically.
The indirection invariant: external references use (page_id, slot_number), never raw byte offsets. Rows can be relocated within a page without breaking any external pointer.
Page Types
Not every page holds rows. A database file is a sequence of pages, each with a type:
| Page Type | Contents | Purpose |
|---|---|---|
| Data page | Row tuples + slot array | Store table data |
| Index page | Key-pointer pairs | B-Tree nodes for fast lookup |
| Free space map | Bitmap of available space per page | Find pages with room for inserts |
| Visibility map | Bitmap of all-visible pages | Skip pages during VACUUM |
| TOAST page | Oversized column values | Store data that doesn’t fit in a row |
| Meta page | Database/table metadata | Root page pointers, statistics |
1
2
3
4
5
6
7
8
9
10
// A PostgreSQL table with one index has at minimum:
//
// Page 0: Meta page (table metadata)
// Page 1-N: Data pages (actual rows)
// Page N+1: Free space map (which pages have room?)
// Page N+2: Visibility map (which pages are fully visible?)
//
// Plus separate file for the index:
// Page 0: Meta page (index root pointer)
// Page 1-M: Index pages (B-Tree nodes)
Every page type uses the same fixed size. The header identifies the type. The buffer pool doesn’t care about types — it caches pages, period. This uniformity is what makes the system manageable.
The Heap File
Table data lives in a heap file — an unordered collection of pages. When you INSERT a row, the database finds a page with enough free space (consulting the free space map) and appends the row. There’s no inherent order. Row 1 might be on page 50. Row 2 might be on page 3.
graph TB
subgraph Heap["Heap File (unordered)"]
direction LR
P0["Page 0\n(rows 5, 12, 3)"]
P1["Page 1\n(rows 1, 8, 15)"]
P2["Page 2\n(rows 9, 2, 11)"]
P3["Page 3\n(rows 7, 14, 4)"]
end
FSM["Free Space Map"] -->|"page 2 has room"| P2
style Heap fill:#1a1a2e,stroke:#e94560,color:#fff
style FSM fill:#0f3460,stroke:#16213e,color:#fff
This seems chaotic, but it’s intentional. An ordered file would require shifting data on every insert to maintain sort order — O(n) per insert. A heap file appends in O(1). The cost is that a full table scan must read every page. But that’s what indexes are for — and indexes are just another set of pages with a different internal structure.
1
2
3
4
5
6
7
8
9
// Full table scan: read every page in the heap
// SELECT * FROM users;
// → Read page 0, page 1, page 2, ... page N
// → Cost: N page reads (sequential I/O — fast)
// Indexed lookup: read a few index pages, then one data page
// SELECT * FROM users WHERE id = 42;
// → Read index root page → internal page → leaf page → data page
// → Cost: 3-4 page reads (random I/O — but very few)
What Happens on a Write
When you UPDATE a row, the database doesn’t write directly to the data file on disk. That would violate the WAL protocol. Instead:
- Log the change — write a WAL record describing the modification
- Modify the in-memory page — update the page in the buffer pool, mark it dirty
- Eventually flush — the background writer or checkpoint process writes dirty pages to disk
The page on disk may be stale — it might not reflect the latest changes. That’s fine. The WAL has the truth. If the database crashes, it replays the WAL from the last checkpoint and brings every page up to date.
sequenceDiagram
participant C as Client
participant BP as Buffer Pool (RAM)
participant WAL as WAL (disk)
participant DF as Data File (disk)
C->>BP: UPDATE row on page 5
BP->>WAL: Log: "page 5, slot 2, old→new"
Note over WAL: fsync WAL
BP->>BP: Modify page 5 in memory (dirty)
BP->>C: OK
Note over BP,DF: Later (checkpoint)...
BP->>DF: Write dirty page 5 to disk
Note over DF: Page 5 now matches WAL
The WAL-first invariant: no modified page is written to the data file until its WAL record is on stable storage. This is called the WAL protocol, or write-ahead logging. It’s the reason databases can crash at any point and recover correctly. The WAL is the source of truth. The data files are a cached materialization.
Checksums: Detecting the Silent Killer
Disks lie. They report successful writes that never made it. They flip bits silently. They return stale data from a cached sector. These are silent data corruption events, and they happen more often than you’d think — studies show bit error rates of 1 in 10^14 to 10^15 for enterprise drives.
Databases defend against this with page-level checksums. Every time a page is written to disk, the database computes a checksum of the page contents and stores it in the header. Every time a page is read from disk, the checksum is verified.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Write path:
fn write_page(page: &mut Page) {
page.header.checksum = crc32(&page.data);
disk_write(page.header.page_id, page);
}
// Read path:
fn read_page(page_id: u32) -> Result<Page, CorruptionError> {
let page = disk_read(page_id);
let expected = crc32(&page.data);
if page.header.checksum != expected {
return Err(CorruptionError::ChecksumMismatch {
page_id,
expected,
actual: page.header.checksum,
});
// Don't serve corrupt data. Ever.
}
Ok(page)
}
The integrity invariant: every page read from disk is verified against its checksum. Corrupt pages are detected, never silently served.
PostgreSQL added full-page checksums in version 9.3. Before that, a bit flip on disk could silently corrupt a row and you’d find out weeks later — or never. The checksum doesn’t prevent corruption. It makes corruption visible, which is the difference between a recoverable incident and a silent catastrophe.
The Numbers
Understanding pages means understanding the scale of I/O:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// PostgreSQL defaults: 8 KB pages
// A table with 1 million rows, ~200 bytes each:
//
// Rows per page: 8192 / 200 ≈ 40 rows
// Pages needed: 1,000,000 / 40 = 25,000 pages
// Disk space: 25,000 × 8 KB = ~200 MB
//
// Full table scan: read 25,000 pages sequentially
// At 1 GB/s sequential read: ~200 ms
//
// Indexed lookup (B-Tree, 3 levels):
// Read 3 index pages + 1 data page = 4 page reads
// At 0.1 ms per random read (SSD): ~0.4 ms
//
// Full scan: 200 ms
// Index lookup: 0.4 ms
// Ratio: 500x
//
// This is why indexes exist.
Every query the optimizer evaluates comes down to this question: how many pages do I need to read? A sequential scan reads all of them. An index lookup reads a handful. A bad query plan is one that reads more pages than necessary. A good query plan is one that reads the minimum.
Putting It Together
The page is the atom of the database. Everything above it — buffer pools, B-Trees, WAL, transactions — operates in terms of pages. Understanding pages means understanding:
- Why databases use 4 KB, 8 KB, or 16 KB pages (aligned with disk hardware)
- Why row size matters (rows that span pages are expensive — TOAST exists for this)
- Why sequential scans can be fast (sequential page reads, prefetchable)
- Why random lookups need indexes (random page reads, each one a separate I/O)
- Why checksums catch corruption (one checksum per page, verified on every read)
The page is the contract between the database and the disk. Fixed size. Aligned. Checksummed. Every invariant the database enforces — atomicity, durability, isolation — is ultimately enforced at the page level, because pages are the unit of I/O.
Next up: the buffer pool — the in-memory cache that stands between your queries and the disk. How the database decides which pages to keep in RAM, which to evict, and why it doesn’t trust the OS to do this job.
This is Post 2 of the series Invariants the Storage Engine Keeps — databases through the guarantees they make about your data, and what breaks when they can’t.
Stay in the loop
Subscribe via RSS to get new posts on systems, Rust, and cryptography.
Subscribe to RSS