Database Optimization: Sharding

Database sharding is the process of horizontally partitioning a dataset across multiple independent database instances. The fundamental objective is to eliminate hardware utilization caps on a single node.

  1. Pain Point: Contention and Resource Exhaustion
    The essence of why sharding is required lies in resource saturation:
  • B-Tree Depth: As row counts increase, the index depth increases, leading to logarithmic growth in I/O operations per query.
  • Lock Contention: Concurrent transactions competing for the same page-level or table-level locks create serialization bottlenecks, reducing throughput.
  • Monolithic I/O: All write operations must eventually pass through a single log sequence (WAL), create a physical I/O limit.
  1. Sharding

a. Sharding key
The sharding key is the partitioning descriptor that determines the physical location of a data row.

  • it is an mapping function where a high-entropy attribute is used to distribute load. If the key is not chosen based on access patterns, hotspots are created.
  • the Sharding key transforms a global search problem O(N) into a localized O(1) routing problem. Without a sharding key in the predicate (the WHERE clause), the system loses its distributed advantage and reverts to a broadcast model.

b. Shared-Nothing Architecture: The Resource Logic
“Shared-nothing” means each node is the cluster is self-sufficient and independent.

  • No two nodes share the same RAM or DISK. There is no central “master” disk that all nodes access.
  • This eliminates “Single points of contention”, if a new node is added, its 100% capacity of system’s CPU, RAM and disk I/O were linearly scaled.

c. Two-Phase Commit(2PC): the Atomic Logic
2PC is essential when a transaction needs to update data on multiple shards simultaneously.

  • Why: if node A succeeds in updaing a balance but Node B fails, the database is corrupted.
  • How:
    • Prepare Phase: the coordinator asks all shards, “Can you commit?” shards lock the resoures and say “Yes”
    • Commit Phase: only if everyone said yes, the coordinator sends the “Commit” command.
  • The Catch: undlink the sharding key, 2PC is a performance killer. It requires multiple network round-trips and holds locks on all involved shards until the very end.
  1. Mechanics of Distributed Query Execution(on a sharded db system)
    When a query is issued, the execution engine performs Query Decomposition:
  • Plan fragmentation: the engine breaks the SQL into “fragments
    that can be executed independently by shards.
  • Predicate Pushdown: the engine pushes filters and aggregations down to the shard level to minimize network data transfer.
  • Result Set Merging: The coordinator node performs final operations (like SUM or SORT) on the partial results returned by shards.
  1. Distinction between Sharding and Partitioning
  • Partitioniiing(Logical): THE DIVISION OF A DATASET WITHIN A SINGLE DATABASE INSTANCE. It improves manageability(e.g., dropping and old month’s data) but does not solve hardware resource exhaustion(CPU/RAM)
  • Sharding(Physical): The division of a dataset across multiple independent nodes. It is a shared-nothing architecture designed to scale hardware resources linearly.
  1. PostgreSQL implementation: Citus and FDW
    PG approaches sharding through extensibility.
  • Postgres_FDW(Foreign Data Wrapper): Provides the interface for cross-node commnunication. It maps local tables to remote instances using the SQL/MED standard.
  • Citus: Another extension implements a distributed query planner. It intercepts SQL commands, decomposes them into fragments, and distributes them to worker nodes. Its essence is Metadata Mnagement- keeping a synchronized map of where every shard reside.
  1. Industry-Leading implementatiosn:
  • TiDB: its essence is the Abstraction Layer. It separated computing(TiDB) from storage (TiKV). It uses the Raft consensus algorithm to manage data “Regions”, providing automatic re-sharding and high availablity without manual intervention.
  • CockroachDB: Its core is Strict Serializability. It uses a monolithic sorted key-value map a s the underlying storage, partitioned into 64MB”ranges”.
  • Azure Cosmos DB: The global Distribution Model
    Cosmos DB utilizes a specific implementation of sharding called Physical Partitions.
    • It decouples logical partitions (defined by partition key) from physical partitions (the underlying compute/storage buckets)
    • Request Units(RU): Cosmos DB is the abstraction of hardware into throughput units. The sharding logic is entirely managed by the fabric, which dynamically splits physical partitions as data grows or throughput demand increases.
  1. Constraint Inherent to Sharding:
  • The join problem: it’s impossible to perform a hash-join between two tables if they are stored on different physical shards without pulling all data to a single coordinator (which crashes the coordinator).

    • Best practice: table co-location, ensure that related tables use the exact same sharding key and sharding algorithm so that related rows always land on the same physical node.
  • Global Consistency vs Latency.
    This is the most difficult technical constraint in distributed systems.

    • The Problem: the “Snapshot” challenge
      Imagine a bak with 100 shards, we want a snapshot of the total balance of all accounts at exactly 12:00:00 PM.

      • The technical conflict: because of network jitter, “12:00:00” might arrive at Shard 1 a few milliseconds earlier that Shard 100. If a transfer is happening between them during those milliseconds, the snapshot will be wrong(money will either be doubled or missing)
      • The global Timestamp: To fix this, a distributed system needs a Global Transaction Mnager or a Global CLock(like Google Spanner’s TrueTime) to give every transaction a strictly ordered timestamp.
      • The Latency Cost: every shard must “check-in” with the global clock or coordinator before it can finish a write. This “checking-in” adds network delay(latency) to every single transaction.
    • Best practice: Use Eventual Consistency for non-critical data. If the business logic allows, use asynchronous replication for secondary indexes to keep the write path fast. Which will trade “perfect accuracy at any given millisecond” for “high-speed writes”. This removes the need for the Global Timestamp bottleneck, thereby reducing latency.

  1. Read Replicas

Sharding is a “last resort” due to its operational complexity.

If the bottleneck is purely read-heavy, Asynchronous Replication(Read Replicas) is the superior alternative. It avoid the complexity of data partitioning and distributed transactions while scaling read throughput linearly.

Read Replicas provides an exact copy of the data. Since reads do not change state, they can be distributed across N nodes without the need for complex sharding keys or distributed transaction coordination. This scales linearly until the Write Volume exceeds the capacity of the primary node.