NoSQL Database

NoSQL Database

SQL has long been the primary choice for data storage. Its general-purpose design makes it suitable for many use cases, but not all. To address the rapid growth of data and its diverse requirements, many alternative systems have been developed that outperform SQL in specific scenarios. These are collectively referred to as NoSQL (Not Only SQL).

SQL relies on the unified Relational Model, while NoSQL encompasses various types of databases with different data models tailored to specific purposes.

NoSQL Characteristics

Despite having diverse data models, NoSQL databases often share several common characteristics.

Schemaless

NoSQL databases are typically schemaless, meaning they don’t require predefined data schemas. This flexibility is possible because:

  • Some systems treat data abstractly as a sequence of binary values, without enforcing any structure.
  • Others infer schema dynamically from the data as it’s inserted, rather than requiring it upfront.

This makes NoSQL especially useful for handling flexible or irregularly structured data, something that SQL struggles with.

For example, consider customer records with varying attributes.

  • In SQL , we need to define a wide table with many columns, which often results in storing numerous NULL values (NULL still takes up storage space).
  • Adding a new attribute requires altering and restructuring the table, which is inefficient.
id: 1name: Mikeid: 1name: Mikeid: 2income: 10000job: developerid: 2income: 10000job: developerid: 3name: Steveschool: ABC Universitygpa: 3.5id: 3name: Steveschool: ABC Universitygpa: 3.5

Horizontal Scalability

NoSQL databases are designed for horizontal scalability. They avoid complex relationships and instead rely on sharding to distribute data across multiple servers.

For example, user records can be split across different servers:

Server 1Server 2Server 3id: 1name: Mikeid: 1name: Mikeid: 2income: 10000job: developerid: 2income: 10000job: developerid: 3name: Steveschool: ABC Universitygpa: 3.5id: 3name: Steveschool: ABC Universitygpa: 3.5

No Relationships

NoSQL databases generally do not support relational features such as foreign keys or joins.

This design choice stems from their distributed architecture, data is often spread across multiple servers. Attempting to join records stored on different nodes can lead to performance bottlenecks and reduced availability.

When using NoSQL, we need to shift our data modeling mindset from normalization to denormalization, the goal is to create self-contained records that are fully queryable from a single server. We’ll explore this concept further in the Document Store section.

Eventual Consistency

Many NoSQL databases prefer Eventual Consistency, trading off strict consistency for better availability and performance. They often maintain multiple replicas that converge over time.

No ACID

NoSQL databases are designed for high availability and fault tolerance but generally do not fully support ACID transactions.

In particular, maintaining the Isolation property across multiple nodes is prohibitively expensive. Ensuring strict transactional behavior would require constant coordination and message exchange over the network, leading to a complex, tightly coupled system that contradicts NoSQL’s design goals.

As a result, NoSQL is generally not well-suited for applications that demand strong consistency and strict transactional guarantees.

Key-value Store

We start with the simplest model: the Key-Value Store.

In this model, each record is identified by a unique key, and accessed via two basic operations:

  • Put(key, value)
  • Get(key) => value

This is similar to building a Hash Table that operates as a shared process.

Example:

ServiceKey-value Storecache_page_0: "<html>This is page 0</html>"user_0_devices: "['apple_998', 'android_110']"cache_page_0: "<html>This is page 0</html>"user_0_devices: "['apple_998', 'android_110']"Get('cache_page_0')Respond '<html>This is page 0</html>'

Use Cases

Key-value stores are ideal when data naturally fits the key-value model. Absolutely, they shouldn’t be used for non-key queries like aggregations.

Internally based on Hash Table and memory, they perform extremely fast key-based lookups. That makes them perfect for use cases like distributed caching.

Server 1Server 2Distributed Cache

Document Store

A Document Store organizes data around documents, each one representing a single record, typically in JSON format.

Student Document 1Student Document 2{  "id""student_a",  "name""Student A"}{  "id": "student_a",  "name": "Student A"}{  "id""student_b",  "name""Student B",  "class_id""class_a",  "gpa"8}{  "id": "student_b",  "name": "Student B",  "class_id": "class_a",  "gpa": 8}

Similar documents are grouped into collections, making management and retrieval simpler. For example, student_collection and room_collection:

student_collectionroom_collection{    "id""student_a",    "name""Student A"}{    "id": "student_a",    "name": "Student A"}{    "id""student_b",    "name""Student B",    "class_id""class_a",    "gpa"8}{    "id": "student_b",    "name": "Student B",    "class_id": "class_a",    "gpa": 8}{    "id""class_a",    "position""4th floor"}{    "id": "class_a",    "position": "4th floor"}

At a high level, documents are conceptually similar to SQL rows, and collections resemble tables. However, there are some key differences:

  • Documents within the same collection can have different schemas. For example, student_b has more fields than student_a. This schema flexibility means structural changes to one document do not impact the rest of the collection.

  • There’s no native concept of relationship. Instead, references are made using plain values, e.g., student_b refers to class_a by storing its id as a plain string.

Data Denormalization

Document Stores are typically distributed, documents within the same collection may reside on different nodes. In such an environment, joining records would require querying multiple servers, which can severely impact performance and availability.

For example, imagine we want to track student registrations across multiple classes. In a relational model, this data is usually normalized into two separate tables: student and registration. To obtain a student’s class registrations, we would join these tables using a shared key. However, executing such joins often requires scanning multiple servers to gather all relevant records, making complex queries resource-intensive and costly.

Server 1Server 2Server 3studentidstu123nameMikeregistrationstudent_idstu123class_idclass123registrationstudent_idstu123class_idclass123

In a Document Store, however, we often embed related data to keep records self-contained. Instead of splitting into two collections, we can include a list of class IDs directly within the student document:

studentidstu123nameMikeclassIds['class123', 'class234']

This design makes reads more efficient in distributed environments by ensuring that read requests are directed to a single server. However, a single value may appear repeatedly across multiple documents, meaning that updates must be applied to each occurrence individually, an operation that can be both costly and prone to errors.

Value-Based Search

Under the hood, many Document Stores use storage structures similar to those in SQL, such as Heap and B-tree. This enables indexing not just on document ids, but also on arbitrary fields.

For example, consider the following document. We can perform fast queries on any indexed field, such as name or gpa.

{
  "id": "stu123",
  "name": "Mike",
  "gpa": 8.4
}

Actually, there’s no magic involved; When querying by a non-key field, the system needs to scan multiple servers internally, because it does not know in advance where these records are stored.

Use Cases

In a way, a Document Store feels like a more flexible, schema-relaxed version of SQL . Document Stores are particularly effective when:

  • Records have complex and variable structures, such as product catalogs with diverse attributes.
  • We need searchable fields beyond primary keys, enabling rich queries based on values.

Column-Oriented Store

Row-Oriented Model

In SQL databases, data is typically stored in a row-oriented layout, where each complete row is stored contiguously on disk. For example, a simple student table in an SQL database might look like this:

student1, Steve, 3.5
student2, Mike, 3.0
student3, John, 2.5

This format is ideal for scenarios where most or all columns of a row are frequently accessed together.

However, it’s inefficient for analytical queries that require only a subset of columns. For example, calculating the average GPA involves reading entire rows into memory, even though we only need the third column. This results in excessive I/O and wasted memory due to unnecessary data loading.

ℹ️
You can refer to the Physical Layer of SQL for more insights on this.

Column-Oriented Model

A Column-Oriented Store takes the opposite approach: it groups and stores data by column instead of by row. Rewriting the student data in columnar format:

student1, student2, student3
Steve, Mike, John
3.5, 3.0, 2.5

Each column is stored as a contiguous block of values. So, when calculating the average GPA, the database only reads the third block, skipping irrelevant data and reducing I/O overhead and memory usage.

Use Cases

Column-Oriented Stores are well-suited for analytical workloads, particularly those in the Online Analytical Processing (OLAP) domain. These workloads typically involve reading a few columns across a large number of rows, perfect for columnar optimization.

However, this design is not ideal for Online Transaction Processing (OLTP) use cases, where full row access is frequent. For example, updating a single record in a table with 100 columns might require 100 separate I/O operations, since each column is stored in a different location.

Due to this overhead, column stores generally:

  • Perform best on read-heavy, analytical use cases.
  • Recommend batch imports instead of row-by-row inserts.

Column-Family Store

A Column-Family (CF) store organizes data by grouping related columns together. Each data row is represented as a set of columns, called a column family.

Let’s clarify the concept with a comparison.

In SQL , data is stored row-by-row according to a strict schema. To access the GPA of a student, we need to refer to the value at the third column within the row:

student1, Steve, 3.5, England
student2, NULL, 3, Vietnam

In contrast, a Column-Family Store treats each row as a flexible map of key-value pairs. Column names are explicitly stored with each row:

row-1:
  id: student1
  name: Steve
  gpa: 3.5

row-2:
  id: student2
  name: Mike
  gpa: 3
  dob: 09-12-2000
  address: HCMC

Benefits of Column Families

  • No NULLs: There is no need to store NULL values.
  • Schema Flexibility: Adding or removing columns affects only the relevant rows, not the whole table.

Log-Structured Merge Trees (LSM)

Column-Family Stores are typically backed by a Log-Structured Merge Tree (LSM) architecture, optimized for high write throughput and large-scale storage.

Memory Layer

In memory, LSM combines two principles:

  1. Write-Behind Caching: Changes are temporarily stored in an in-memory MemTable. When the MemTable reaches a size threshold, it is flushed to disk to save data.
  2. Write-Ahead Logging (WAL): Writes are immediately logged to disk via a WAL for durability. If the system crashes before flushing, the WAL ensures no data is lost.
StoreClientMemTableWALHard driveFlush periodically1. Update in memory2. Log the operation

Storage Layer

In the storage layer, LSM* structures data into multiple levels:

  • Each level consists of several immutable Sorted String Tables (SSTables).
  • An SSTable stores sorted key-value pairs, allowing fast binary search for efficient lookups.

Consider an example of an LSM tree with two levels, where records are stored as key=value pairs. In this structure, it’s possible for the same key to appear across multiple levels, each representing a different version of the record.

StoreLevel 1Level 2SSTable 0SSTable 0SSTable 1a=100b=2000c=50a=120b=2000c=100d=50e=60

Compaction & Merging

In the background, periodic compaction is performed, pushing data down to deeper levels. This design enables extremely fast writes, updates are first written to the memory layer, then flushed and reorganized asynchronously in the storage layer.

Let’s walk through an example to understand how this process works in practice:

Step 1: Flushing MemTable

Once the MemTable fills up, it’s flushed as a new SSTable into Level 1.

StoreMemTableLevel 1Level 2a=110d=70SSTable 0 (Existing)SSTable 1 (Newly Flushed)SSTable 0a=100b=2000c=50a=110d=70a=120c=2000Flush
Step 2: Merging Within Level 1

Level 1’s SSTables are compacted and merged to discard duplicates.

For instance, the previous insertion triggers merging at Level 1.

  • Level 1 merges between SSTable 0 and SSTable 1.
  • SSTable 1 is newer and overwrites SSTable 0.
StoreMemTableLevel 1Level 2SSTable 0 (Created 00:01)SSTable 1 (Created 00:02)Merged SSTableSSTable 0a=100b=2000c=50a=110d=70a=110b=2000c=50d=70a=120c=2000
Step 3: Level Promotion

If Level 1 becomes full, it’ll be promoted to the Level 2. Level 2 similarly merges its SSTables to discard duplicates.

StoreMemTableLevel 1Level 2SSTable 0SSTable 0SSTable 1 (Newer)Merged SSTablea=110b=2000c=50d=70a=120c=2000a=110b=2000c=50d=70a=110b=2000c=50d=70Moved to

The final result looks clean, Level 2 is the only one containing data.

StoreMemTableLevel 1 (Empty)Level 2SSTable 0a=110b=2000c=50d=70

Why do we need levelling?

Each level holds increasingly larger and older data sets. By prioritizing searches in shallower levels, LSM achieves cache-like behavior, where recently written data is accessed faster.

Level 1...Level NOlder and larger

Use Cases

Column-Family Stores are highly suitable for:

  • Write-heavy workloads.
  • Key-based access patterns.

However, they’re not ideal for:

  • Write-once workloads: Constantly unique data leads to deep LSM trees with minimal compaction benefits.
  • Value-based queries: Since data is indexed by keys, querying non-key fields is inefficient.

Search Engine

We now arrive at the final type of data store in this discussion: the Search Engine. As its name implies, a Search Engine is purpose-built to support search operations, especially text-based searches.

Inverted Index

At its core, a search engine relies on the Inverted Index data structure.

Traditional indices map from keys to values. For instance, in a student database:

fromIdToName:
  student01: Mike
  student02: John
  student03: Mike

An Inverted Index reverses the usual direction of key-value mapping. Instead of mapping from a unique key to a value, it maps from a value (which is often not unique) back to one or more keys:

fromNameToIds:
  Mike: ["student01", "student03"]
  John: ["student02"]

This reversal is especially powerful for text search, as it allows us to quickly find where a given word or value appears.

Full-Text Search

In a conventional database, search operations are performed against raw fields as inserted.

Take the example of an inverted index of books:

fromTitleToIds:
  Little Prince: ["book01"]
  Little Women: ["book02"]

To search books by title, we’d need to scan through all the title indices with a basic matching tools like SQL’s LIKE, which is inefficient.

Full-Text Store

To enable fine-grained search, search engines tokenize text fields into individual terms, indexing each one separately. These terms and their mappings form the Full-text Store, a type of inverted index.

terms:
  Little: ["book01", "book02"]
  Prince: ["book01"]
  Women: ["book02"]
Full-text StoreDataLittlePrinceWomenbook01:  title: Little Princebook01:  title: Little Princebook02:  title: Little Womenbook02:  title: Little Women

Now, searching for a term like Prince instantly returns all associated records via the inverted index, no scanning required.

Text Analysis

Search Engine has a component called Analyzer splitting and processing text into consistent, searchable terms.

Insertion Analysis

When data is inserted, it is passed through an analyzer which tokenizes and transforms it (e.g., lowercasing, removing punctuation):

Little PrinceTermsAnalyzerlittleprinceInsert

Little Prince becomes [little, prince], ready for indexing.

Query Analysis

The same analyzer is applied to search queries to ensure consistency.

For example, if a user searches for litTLe, which may not exactly match the version stored in the index, the analyzer needs to process both the query and the stored data in the same way, producing a consistent value.

litTLelittleAnalyzerSearch

Search Engine can also use advanced analyzers to handle synonyms, stemming, or multiple languages, further enriching the search experience.

Use Cases

Search engines are typically not used as the primary database. Reasons include:

  • High storage overhead: Indexing every term from raw text can consume significant space, often exceeding the size of the data itself.
  • Limited non-text capabilities: Operations like aggregations, transactional updates, or structured queries are often better handled by traditional databases.

Instead, they shine as search satellites, optimized for querying, layered on top of primary databases like SQL or Document Stores:

Primary DatabaseSearch EngineReplicate

In this architecture, changes are replicated to the search engine, enabling full-text search without compromising the performance or integrity of the main database.

Other NoSQL Databases

Beyond the stores we’ve covered, there are still many other types of NoSQL databases, such as Graph, Time Series, and more.

While we won’t explore all of them here, here’s a useful framework you can use when learning a new type of database:

  • What problem does it aim to solve? Every database emerges to address specific challenges. Understand the motivation behind its design is a must.

  • What is its data model? Is it graph-based, time-based, or something else? How is the model implemented in practice?

  • What are its typical use cases? Consider what kinds of applications benefit most from this database. That often reveals its strengths and limitations.

Last updated on