Physical Layer
The way data is organized at the physical layer drives the entire database workflow.
Page
A page is a fixed-size container (typically a few kilobytes, e.g., PostgreSQL defaults to 8kB) holding multiple data entries, which can be either rows or index records.
A page generally consists of three parts: Header, Tuples, and a Line Pointer Array (LPA).
Header
The header primarily holds metadata for data integrity, recovery, and other management purposes. For now, we don’t need to worry about its details.
Tuple
The actual data resides in the Tuples section as sequentially packed records.
We call them tuples rather than records because tuples are immutable. To update a tuple, the database creates a new version, essentially an overriding tuple.
For example, the third tuple is an updated version of the first one.
There are two main reasons for the immutability:
A new tuple might be larger or smaller than the existing one (for example, a resizable VARCHAR field). If we tried to update it in place, subsequent tuples would need to shift to accommodate size changes, much like deleting an element from an array, which can be costly.
Header (Metadata) Tuples Old Tuples Updated Tuples (ID=1, Name="John") (ID=3, Name="Charlie") (ID=1, Name="John Doe") (ID=3, Name="Charlie") Moved Increased To support Multi-version Concurrency Control (MVCC), where transactions can access different versions of the same data concurrently. We’ll cover this more deeply in the Concurreny Control topic.
Line Pointer Array (LPA)
Between the header and the tuple data lies the Line Pointer Array (LPA), which holds fixed-size pointers to the actual tuples on the page. Each pointer also tracks the state of its tuple (e.g., Updated, Deleted).
Since tuple sizes can vary (primarily due to text fields), utilizing fixed-size pointers within the Line Pointer Array (LPA) guarantees fast, random access to any specific tuple. Retrieving a pointer in this manner is analogous to accessing an element in an array:
pointer[i] = page_address + pointer_size * (i - 1)
The primary function of the LPA is to maintain the stability of tuple references. Even if a row is relocated among different tuples within the page, its corresponding line pointer remains constant, ensuring consistent referencing.
Heap
A heap is simply a collection of pages representing a table. The term heap means essentially no structure.
When inserting a record into a heap, the database looks for the first available slot in the first available page. For example:
Tuple ID (TID)
External elements, such as other tables, reference data tuples using a Tuple ID (TID). A TID is a composite identifier consisting of: (page number, LPA index).
Fixed-size pages make random access efficient,
for example,
to retrieve the tuple at: TID (page = 2, index = 1)
;
The database can jump straight to it via: Heap.Pages[2].LPA[1] -> Tuple
.
Heap Search
However, a heap isn’t particularly helpful for searching. There are two effective ways to access a tuple in a heap:
- Using its TID.
- Scanning the entire heap (i.e., a full table scan) 🥲.
Index
Messy heaps aren’t ideal for fast tuple lookups, this is where the Index comes to the rescue. In short, an index is an auxiliary data structure built alongside a table to enable efficient data retrieval.
Unlike tables, indexes are organized using specific data structures tailored for quick lookups. In this section, we’ll focus on the most commonly used type: the B-Tree Index.
B-Tree Index
A B-Tree is an advanced form of a Binary Search Tree. Instead of storing one key per node like a binary tree, it groups multiple keys in a single node, increasing the branching factor and reducing the overall height of the tree.
For example, a B-Tree might store up to 5 elements per node. Each node either contains actual values or pointers to its child nodes.
In SQL , an index page corresponds to a B-Tree node. Each page can point either to child index pages or to actual table records via their TIDs. A lookup operation traverses this structure, starting from the root page, it moves down through child pages until it locates the desired tuplee.
Notably, using indexes introduces a trade-off between read and write performance. To maintain a B-Tree’s self-balancing structure, operations such as inserts, deletes, or updates may trigger node splits or merges. This overhead ensures efficient reads but comes at the cost of slower writes, especially in tables with multiple indexes, where every change must be reflected across all relevant index structures.
HOT Update
Given the immutability of tuples, when an update occurs, a new tuple is created. However, we don’t need to necessarily update all associated indices for every update.
An update can be treated as a HOT (Heap-Only Tuple) and does not require index manipulation if:
- The update doesn’t modify indexed columns.
- And the new tuple remains on the same page as the original tuple.
Tuple Chaining
When multiple versions of a tuple reside on the same page, they are chained together. Each tuple carries metadata, such as its transaction state and a pointer to the next version. This chain ensures queries can resolve to the correct visible version of a tuple.
For example, deleted tuples may include a pointer to the next valid tuple, allowing the system to traverse and retrieve the most up-to-date version.
As old tuple versions become obsolete (no longer visible to any active transaction), their space is reclaimed. The line pointer in the page is then updated to reference the latest valid tuple. This cleanup happens without requiring changes to the index itself, keeping lookup operations consistent.
Secondary vs Clustered Index
In this explanation, we’ve assumed records are stored in a disorganized heap and that indexes exist outside the table data. These are known as Secondary Indexes.
However, in some database engines (like MySQL InnoDB), tables are physically organized based on their primary key, also called Clustered Index. Queries using the primary key access the table directly, no secondary index needed for these lookups.