Database Internalsmodule 2 of 8
Module 02 ~13 min · the storage layer

Pages & Extents
The storage floor plan

Disks don't know about rows, tables, or users. They know about bytes at offsets. The storage engine's whole job is to translate the messy world of variable-sized rows into the rigid world of fixed-size blocks.

📦 Hold this picture

A self-storage warehouse. Every unit is exactly the same size — a page, 4096 bytes. Units are grouped into blocks of 8 (an extent) because it's cheaper for the moving truck (the disk) to pick up a whole block than make 8 separate trips. A front-desk clipboard (the page cache) remembers which units were opened most recently, so you don't walk to the back for every lookup.

◆ The key insight

Databases use fixed-size pages because disks move data in fixed-size blocks. Matching the page size to the disk's native block size is why databases feel fast.

Why should you care?

When someone says "this query is slow because of too many page reads," or when you're picking a cacheSize, you need to know what a page is. Knowing the size also sanity-checks AI suggestions — "cache 1000 pages" means ~4 MB, not 4 GB.

01

Four ideas that run the floor

4096 bytes

Page

The atom of storage. Everything on disk is a page. Nothing smaller is ever read or written.

8 pages = 32 KB

Extent

A neighborhood of 8 adjacent pages, fetched as a group for locality — one truck trip, not eight.

one bit

Dirty flag

Says "this page has changes in memory that haven't hit disk yet." Lets writes be batched.

least recently used

LRU cache

Keeps hot pages in RAM. When full, evicts the one nobody has touched for the longest.

02

Find any page on disk

Every page has a global id. From it, the engine computes which extent it lives in and its exact byte offset in the file. Click a page — or type one — and watch the math.

Storage floor plan4 extents · 8 pages each · 4 KB/page
or click any cell
03

A page, in code

Storage/Page.cs
public class Page
{
    public const int PageSize = 4096; // 4KB pages

    public int PageId { get; set; }
    public byte[] Data { get; set; }
    public bool IsDirty { get; set; }

    public Page(int pageId)
    {
        PageId = pageId;
        Data = new byte[PageSize];
        IsDirty = false;
    }
}
In plain English

PageSize = 4096 — a hard constant. Every page is exactly 4 KB, no exceptions.

PageId — the page's global number. Multiply by 4096 and you have its offset in the file.

Data — the raw bytes. Rows, tree nodes, anything — it's all just bytes in here.

IsDirty — the dirty flag. true means "modified in memory, needs writing to disk."

04

An extent groups eight pages

Storage/Extent.cs
public class Extent
{
    public const int PagesPerExtent = 8;
    public int ExtentId { get; set; }
    public Page[] Pages { get; set; }

    public Extent(int extentId)
    {
        ExtentId = extentId;
        Pages = new Page[PagesPerExtent];

        for (int i = 0; i < PagesPerExtent; i++)
        {
            int pageId = extentId * PagesPerExtent + i;
            Pages[i] = new Page(pageId);
        }
    }
}
The address math

An extent holds an array of exactly 8 pages.

The line extentId * PagesPerExtent + i is the whole trick. It turns an extent number and a slot into a global page id.

Example: page 17 lives in extent floor(17/8) = 2, slot 17 % 8 = 1. Exactly what the explorer above showed you.

Why 4 KB? Most SSDs and OS page caches work in 4 KB blocks. Matching the database page size to the OS block size means one database page equals one disk I/O — no wasted reads.

05

The LRU cache keeps pages hot

RAM is thousands of times faster than disk. The cache holds recently-used pages so most lookups never touch the disk at all.

Storage/PageCache.cs — Put (LRU)
public void Put(int pageId, Page page)
{
    lock (_lockObject)
    {
        if (_cache.TryGetValue(pageId, out var node))
        {
            node.Page = page;
            MoveToHead(node);     // touched = most recent
            return;
        }

        var newNode = new CacheNode(page);
        _cache[pageId] = newNode;
        AddToHead(newNode);       // newest at the head

        if (_cache.Count > _capacity)
        {
            var removed = RemoveTail();   // evict the coldest
            _cache.TryRemove(removed.Page.PageId, out _);
        }
    }
}
The head/tail dance

Picture a doubly-linked list. The head is the most recently used page; the tail is the coldest.

Already cached? MoveToHead — touching a page makes it most-recent.

New page? AddToHead, then if we're over capacity, RemoveTail evicts whatever nobody has touched in the longest time.

That's LRU in eight lines — and it's why a well-sized cache makes the disk almost disappear.

Dirty pages are lazy. A page can be modified in memory dozens of times and written to disk only once — when a flush is requested. That batching is a big reason databases are fast: disk writes are expensive, so you do as few as possible.

06

Drive the cache: hits, misses & evictions

Reading the code is one thing — feeling the cache behave is another. Below is a live LRU cache with room for just 4 pages (kept tiny so evictions happen fast). Request a page by id: a hit finds it and slides it to the head; a miss loads it from disk and, when the cache is full, evicts the coldest page from the tail. Watch the hit-rate climb when you re-request pages you've touched recently.

LRU page cachecapacity 4 · head = most recent · tail = evicted
hits 0 misses 0 hit-rate

Try this: request 3, 7, 1, 4 (all hits — they're pre-loaded), then request 9. The cache is full, so the page you haven't touched longest — sitting at the tail — gets evicted to make room. That single decision, repeated millions of times a second, is what keeps a database's working set in fast memory.

07

Drag each page to its extent

Use the math: extent = floor(pageId / 8). Drag the chips into the right zone, then check.

Pages — drag me
Page 0
Page 7
Page 8
Page 15
Page 23
Extent 0 pages 0–7
Extent 1 pages 8–15
Extent 2 pages 16–23
08

Check yourself

Calculation
If the cache holds 100 pages at 4 KB each, how much RAM is that?
Correct. 100 × 4 KB = 400 KB. The common trap is misreading the units — pages are tiny, so even a generous cache is cheap.
Scenario
You've written 10,000 rows but never called Flush(). The power is cut. What survives?
Right. Dirty pages live in RAM until a flush. The WAL is the safety net — it's the one thing written durably as you go. That's the whole subject of Module 5.
Architecture
Why does the cache evict the least recently used page, not the least frequently used?
Exactly. A move-to-head is O(1). Counting frequency means bookkeeping on every single access. LRU is a cheap, good-enough approximation — and cheap wins in a hot path.

Up next: we know where bytes live. The next question is the one every database must answer billions of times a day — how do you find one specific row among millions, fast? Enter the B+ tree.