Physical Layer

The way data is organized at the physical layer drives the entire database workflow.

Page

A page is a fixed-size container (typically a few kilobytes, e.g., PostgreSQL defaults to 8kB) holding multiple data entries, which can be either rows or index records.

A page generally consists of three parts: Header, Tuples, and a Line Pointer Array (LPA).

Header

The header primarily holds metadata for data integrity, recovery, and other management purposes. For now, we don’t need to worry about its details.

PageHeader (Metadata)

Tuple

The actual data resides in the Tuples section as sequentially packed records.

We call them tuples rather than records because tuples are immutable. To update a tuple, the database creates a new version, essentially an overriding tuple.

For example, the third tuple is an updated version of the first one.

PageHeader (Metadata)Tuples (Data)(ID=1, Name="John")(ID=3, Name="Charlie")(ID=1, Name="John Doe")

There are two main reasons for the immutability:

  1. A new tuple might be larger or smaller than the existing one (for example, a resizable VARCHAR field). If we tried to update it in place, subsequent tuples would need to shift to accommodate size changes, much like deleting an element from an array, which can be costly.

    Header (Metadata)TuplesOld TuplesUpdated Tuples(ID=1, Name="John")(ID=3, Name="Charlie")(ID=1, Name="John Doe")(ID=3, Name="Charlie")MovedIncreased
  2. To support Multi-version Concurrency Control (MVCC), where transactions can access different versions of the same data concurrently. We’ll cover this more deeply in the Concurreny Control topic.

Line Pointer Array (LPA)

Between the header and the tuple data lies the Line Pointer Array (LPA), which holds fixed-size pointers to the actual tuples on the page. Each pointer also tracks the state of its tuple (e.g., Updated, Deleted).

PageHeader (Metadata)Line Pointer ArrayTuples (Data)(1, Updated)(2, Normal)(3, Deleted)(ID=1, Name="John")(ID=3, Name="Charlie")(ID=1, Name="John Doe")

Since tuple sizes can vary (primarily due to text fields), utilizing fixed-size pointers within the Line Pointer Array (LPA) guarantees fast, random access to any specific tuple. Retrieving a pointer in this manner is analogous to accessing an element in an array:

pointer[i] = page_address + pointer_size * (i - 1)

The primary function of the LPA is to maintain the stability of tuple references. Even if a row is relocated among different tuples within the page, its corresponding line pointer remains constant, ensuring consistent referencing.

Heap

A heap is simply a collection of pages representing a table. The term heap means essentially no structure.

When inserting a record into a heap, the database looks for the first available slot in the first available page. For example:

HeapINSERT INTO table VALUES (4, 'Tom')Page 1Page 2Header (Metadata)Line Pointer ArrayTuples (Data)Header (Metadata)Line Pointer ArrayTuples (Data)(1, Normal)(2, Normal)(ID=3, Name="John")(ID=1, Name="Charlie")(1, Normal)(ID=2, Name="Mary")

Tuple ID (TID)

External elements, such as other tables, reference data tuples using a Tuple ID (TID). A TID is a composite identifier consisting of: (page number, LPA index).

Fixed-size pages make random access efficient, for example, to retrieve the tuple at: TID (page = 2, index = 1); The database can jump straight to it via: Heap.Pages[2].LPA[1] -> Tuple.

HeapTID (page = 2, index = 1)Page 1Page 2Header (Metadata)Line Pointer ArrayTuples (Data)Header (Metadata)Line Pointer ArrayTuples (Data)(1, Normal)(2, Normal)(ID=3, Name="John")(ID=1, Name="Charlie")(1, Normal)(2, Normal)(ID=2, Name="Mary")(ID=4, Name="Tom")

Heap Search

However, a heap isn’t particularly helpful for searching. There are two effective ways to access a tuple in a heap:

  1. Using its TID.
  2. Scanning the entire heap (i.e., a full table scan) 🥲.

Index

Messy heaps aren’t ideal for fast tuple lookups, this is where the Index comes to the rescue. In short, an index is an auxiliary data structure built alongside a table to enable efficient data retrieval.

Unlike tables, indexes are organized using specific data structures tailored for quick lookups. In this section, we’ll focus on the most commonly used type: the B-Tree Index.

B-Tree Index

A B-Tree is an advanced form of a Binary Search Tree. Instead of storing one key per node like a binary tree, it groups multiple keys in a single node, increasing the branching factor and reducing the overall height of the tree.

For example, a B-Tree might store up to 5 elements per node. Each node either contains actual values or pointers to its child nodes.

3560102040508890100

In SQL , an index page corresponds to a B-Tree node. Each page can point either to child index pages or to actual table records via their TIDs. A lookup operation traverses this structure, starting from the root page, it moves down through child pages until it locates the desired tuplee.

IndexHeapPage 1 (Root)Page 2Page 3Page 1Page 2Page 3Id = 1, (Page = 1, Page Offset = 2)Page 2Id = 3, (Page = 1, Page Offset = 1)Page 3Id = 2, (Page = 2, Page Offset = 1)Id = 4, (Page = 3, Page Offset = 1)(ID=3, Name="John")(ID=1, Name="Charlie")(ID=2, Name="John")(ID=4, Name="Tom")

Notably, using indexes introduces a trade-off between read and write performance. To maintain a B-Tree’s self-balancing structure, operations such as inserts, deletes, or updates may trigger node splits or merges. This overhead ensures efficient reads but comes at the cost of slower writes, especially in tables with multiple indexes, where every change must be reflected across all relevant index structures.

HOT Update

Given the immutability of tuples, when an update occurs, a new tuple is created. However, we don’t need to necessarily update all associated indices for every update.

An update can be treated as a HOT (Heap-Only Tuple) and does not require index manipulation if:

  • The update doesn’t modify indexed columns.
  • And the new tuple remains on the same page as the original tuple.

Tuple Chaining

When multiple versions of a tuple reside on the same page, they are chained together. Each tuple carries metadata, such as its transaction state and a pointer to the next version. This chain ensures queries can resolve to the correct visible version of a tuple.

For example, deleted tuples may include a pointer to the next valid tuple, allowing the system to traverse and retrieve the most up-to-date version.

IndexTID (index = 1)LPATuplesPointer (index = 1)(State = DELETED, Id = 1, Name = Jnho)(State = DELETED, Id = 1, Name = John)(State = NORMAL, Id = 1, Name = Johnny)

As old tuple versions become obsolete (no longer visible to any active transaction), their space is reclaimed. The line pointer in the page is then updated to reference the latest valid tuple. This cleanup happens without requiring changes to the index itself, keeping lookup operations consistent.

IndexHeap(Offset = 1)Heap (version 1)Heap (version 2)Heap (version 3)LPATuplesLPATuplesLPATuplesPointer (Offset = 1)(State = DELETED, Id = 1, Name = Jnho)(State = DELETED, Id = 1, Name = John)(State = NORMAL, Id = 1, Name = Johnny)Pointer (Offset = 1)...Removed(State = DELETED, Id = 1, Name = John)(State = NORMAL, Id = 1, Name = Johnny)Pointer (Offset = 1)...Removed...Removed(State = NORMAL, Id = 1, Name = Johnny)CleanedCleaned

Secondary vs Clustered Index

In this explanation, we’ve assumed records are stored in a disorganized heap and that indexes exist outside the table data. These are known as Secondary Indexes.

However, in some database engines (like MySQL InnoDB), tables are physically organized based on their primary key, also called Clustered Index. Queries using the primary key access the table directly, no secondary index needed for these lookups.

Last updated on