Distributed databases

Distributed databases in a nutshell

Hey I just met you, the network’s laggy

but here’s my data, so store it maybe

A distributed database is a distributed system designed to provide read/write access to data, using the relational or other format.

Splitting the data among nodes

We will examine 3 topics in distributed databases:

Replication: Keep a copy of the same data on several different nodes.
Partitioning: Split the database into smaller subsets and distributed the partitions to different nodes.
Transactions: Mechanisms to ensure that data is kept consistent in the database

Q: Usually, distributed databases employ both replication and partitioning at the same time. Why?

Replication

MySQL async replication

Why replicate?

With replication, we keep identical copies of the data on different nodes.

D: Why do we need to replicate data?

To allow the system to work, even if parts of it are down
To have the data (geographically) close to the clients
To increase read throughput, by allowing more machines to serve read-only requests

Replication Architectures

In a replicated system, we have two node roles:

Leaders or Masters: Nodes that accept writes from clients
Followers, Slaves or replicas: Nodes that provide read-only access to data

Depending on how replication is configured, we can see the following architectures

Single leader or master-slave: A single leader accepts writes, which are distributed to followers
Multi leader or master-master: Multiple leaders accept writes, keep themselves in sync, then update followers
Leaderless replication All nodes are peers in the replication network

How does replication work?

The general idea in replicated systems is that when a write occurs in node, this write is distributed to other nodes in either of the two following modes:

Synchronously: Writes need to be confirmed by a configurable number of followers before the leader reports success.
Asynchronously: The leader reports success immediately after a write was committed to its own disk; followers apply the changes in their own pace.

D: What are the advantages/disadvantages of each replication type?

Statement based replication in databases

The leader ships all write statements to the followers, e.g. all INSERT or UPDATE queries. However, this is problematic:

UPDATE foo
SET updated_at=NOW()
WHERE id = 10

D: What is the issue with the code above in a replicated context?

Statement based replication is non-deterministic and mostly abandoned.

Write-ahead log based replication

All databases write their data to data structures, such as \(B^+\)-trees. However, before actually modifying the data structure, they write the intended change to an append-only write-ahead log (WAL).

WAL-based replication writes all changes to the leader WAL and also to the followers. The followers apply the WAL entries to get consistent data.

The main problem with WAL-based replication is that it is bound to the implementation of the underlying data structure; if this changes in the leader, the followers stop working.

Postgres and Oracle use WAL-based replication.

Logical-based replication

The database generates a stream of logical updates for each update to the WAL. Logical updates can be:

For new records, the values that where inserted
For deleted records, their unique id
For updated records, their id and the updated values

MongoDB and MySQL use this replication mechanism.

Process for creating a replica

Take a consistent snapshot from the leader
Ship it to the replica
Get an id to the state of the leader’s replication log at the time the snapshot was created
Initialize the replication function to the latest leader id
The replica must retrieve and apply the replication log until it catches up with the leader

A real example: MySQL

Leader

> SHOW MASTER STATUS;
+--------------------+----------+
| File               | Position |
+--------------------+----------+
| mariadb-bin.004252 | 30591477 |
+--------------------+----------+
1 row in set (0.00 sec)

Follower

>CHANGE MASTER TO
  MASTER_HOST='10.0.0.7',
  MASTER_USER='replicator',
  MASTER_PORT=3306,
  MASTER_CONNECT_RETRY=20,
  MASTER_LOG_FILE='mariadb-bin.452',
  MASTER_LOG_POS= 30591477;

> SHOW SLAVE STATUS\G

Master_Host: 10.0.0.7
Master_User: replicator
Master_Port: 3306
Master_Log_File: mariadb-bin.452
Read_Master_Log_Pos: 34791477
Relay_Log_File: relay-bin.000032
Relay_Log_Pos: 1332
Slave_IO_Running: Yes
Slave_SQL_Running: Yes

Complications with async replication

Read-after-write: A client must always be able to read their own writes from any part of the system. Practically, the client can:
- Read recent writes from the leader
- Read session data from the leader
(non-) Monotonic reads: When reading from multiple replicas concurrently, a stale replica might not return records that the client read from an up to date one.
Causality violations: Violations of the happened-before property.

Async replication is fundamentally a consensus problem.

Multi-leader replication

The biggest problem with multi-leader replication are write conflicts. To demonstrate this, let’s think in terms of git:

                                # Clock
# User 1
git clone git://....            # t+1
git add foo.c                   # t+2
##hack hack hack
git commit -a -m 'Hacked v1'    # t+3
git push                        # t+5

# User 2
git clone git://....            # t+2
git add foo.c                   # t+5
##hack hack hack
git commit -m 'Hacked new file' # t+6
git push # fails                # t+7
git pull # CONFLICT             # t+7

If we replace user with leader node, we have exactly the same problem

How to avoid write conflicts?

One leader per session If session writes do not interfere (e.g., data are only stored per user), this will avoid issues altogether.
Converge to consistent state Apply a last-write-wins policy, ordering writes by timestamp (may loose data) or report conflict to the application and let it resolve it (same as git or Google docs)
Use version vectors Modelled after version clocks, they encode happens before relationships at an object level.

Partitioning

Types of partitioning

Why partitioning?

With partitioning, each host contains a fraction of the whole dataset.

The main reason is scalability:

Queries can be run in parallel, on parts of the dataset
Reads and writes are spread on multiple machines

Partitioning is always combined with replication. The reason is that with partitioning, a node failure will result in irreversible data corruption.

How to partition?

The following 3 strategies are

Range partitioning Takes into account the natural order of keys to split the dataset in the required number of partitions. Requires keys to be naturally ordered and keys to be equally distributed across the value range.
Hash partitioning Calculates a hash over the each item key and then produces the modulo of this hash to determine the new partition.
Custom partitioning Exploits locality or uniqueness properties of the data to calculate the appropriate partition to store the data to. An example would be pre-hashed data (e.g. git commits) or location specific data (e.g. all records from Europe).

Request routing

On partitioned datasets, clients need to be aware of the partitioning scheme in order to direct queries and writes to the appropriate nodes.

To hide partitioning details from the client, most partitioned systems feature a query router component siting between the client and the partitions.

The query router knows the employed partitioning scheme and directs requests to the appropriate partitions.

Partition and replication example: MongoDB

Sharding and replication in MongoDB

Transactions

Optional reading (not part of the exam material)

Many clients on one database

What could potentially go wrong?

Many clients try to update the same data store at the same time, when….

the network fails, and then…

the database leader server cannot reach its network-mounted disk, so…

the database tries to fail over to a follower, but it is unreachable, and then…

the application writes timeout.

What is the state of the data after this scenario?

As programmers, we are mostly concerned about the code’s happy path. Systems use transactions to guard against catastrophic scenarios.

What are transactions?

Transactions[1] are blocks of operations (reads and writes), that either succeed or fail, as a whole.

Transactions are defined in the context of an application that wants to modify some data and a system that guards data consistency.

The application initiates a transaction, and when done, it commits it.
If the system deems the transaction successful, it guarantees that the modifications are safely stored
If the system deems the transaction unsuccessful, the transaction is rollbacked; the application can safely retry.

Transactions in relational databases

The concept of the transaction started with the first SQL database, System R; while the mechanics changed, the semantics are stable for the last 40 years.

Transactions provide the following guarantees[2]:

Atomicity: The transaction either succeeds or fails; in case of failure, outstanding writes are ignored
Consistency: any transaction will bring the database from one valid state to another
Isolation: Concurrent execution of transactions results in a system state that would be obtained if transactions were executed sequentially
Durability: once a transaction has been committed, it will remain so

Isolation

Isolation guards data consistency in the face of concurrency.

Databases try to hide concurrency issues from applications by providing transaction isolation.

Serializability

By default, isolation entails executing transactions as if they were serially executed. To implement serializability, databases:

Only execute in a single core (e.g. Redis)
Use 2 phase locking: all database objects (e.g. rows) have a shared lock and an exclusive lock
- All readers acquire a shared lock
- A writer needs to acquire the exclusive lock
- Deadlocks can happen when a transaction \(A\) waits for \(B\) to release their exclusive locks and vice versa
Use MVCC and Copy-on-Write data structures

Multi-Version Concurency Control

With MVCC, the database works like Git:

Each transaction \(A\) sees the most recent copy of the data (git checkout -b A)
When \(A\) commits (git commit -a):
- If \(A\) touched no object that was updated before by another transaction the database will create a new version (git checkout master; git merge A)
- If \(A\) changed an object that was updated before, the database with report a conflict
The current state is the result of the application of all intermediate transactions on an initial state
Occasionally, we need to clean up ignored intermediate states (git gc)

Weaker isolation levels

While serialiazable isolation works it may be slow. Consequently, historically, databases made compromises in how they implement isolation.

Dirty reads: A transaction reads data written by a concurrent uncommitted transaction
Non Repeatable Reads: A transaction re-reads data it has previously read and finds that data modified
Phantom reads: Results to queries change due to other transactions being committed

Isolation	DR	NRR	PR
Read uncommitted	X	X	X
Read committed		X	X
Repeatable read			X
Serializable

Distributed transactions

Atomic commits

In a distributed database, a transaction spanning multiple nodes must either succeed on all nodes or fail (to maintain atomicity).

Transactions may also span multiple systems; for example, we may try to remove a record from a database and add it to a queue service in an atomic way.

In contrast to the distributed systems consensus problem, all nodes must agree on whether a transaction has been successfully committed.

The most common mechanism used to dead with distributed atomic commits is the two-phase commit (2PC) protocol.

2-phase commits

In 2PC, we find the following roles:

A coordinator or transaction manager
Participants or cohorts

The 2PC protocol makes the following assumptions:

There is stable, reliable storage on the cohorts
No participant will crash forever
Participants use (indestructible) write-ahead log
Any two nodes can communicate with each other

A transaction in 2PC

A client starts a 2PC transaction by acquiring a globally unique transaction id (GTID)
The client attaches the GTID to each transaction it begins with individual nodes
When the client wants to commit, the coordinator sends a PREPARE message to all nodes
If at least one node replies with ‘NO’, the transaction is aborted
If all nodes reply with ‘YES’, the coordinator writes the decision to disk and tries forever to commit individual transactions to the cohort

D: What problems do you see with 2PC?

Problems with 2PC

Holding locks in nodes: While a decision is being made, faster nodes must lock objects; consequently, a slow or crashing node will kill the performance of the whole system
Exaclty-once semantics: All nodes participating in a distributed transaction must support transactions to guarantee exactly once semantics
Co-ordinator failures: Even though the co-ordinator must be able to resist corruption in case of crashes, this is not always the case and semi-aborted transactions affect performance due to locking

Image credits

Jespen, by K. Kingsbury
MySQL async replication, MySQL documentation
Database partitioning from the Oracle 11g documentation
MongoDB sharding by StackExchange user Jerry

Bibliography

[1]

J. Gray, “The transaction concept: Virtues and limitations (invited paper),” in Proceedings of the seventh international conference on very large data bases - volume 7, 1981, pp. 144–154.

[2]

T. Haerder and A. Reuter, “Principles of transaction-oriented database recovery,” ACM Comput. Surv., vol. 15, no. 4, pp. 287–317, Dec. 1983.

[3]

K. Kingsbury, “Jespen: On the perils of network partitions,” 2013. [Online]. Available: https://aphyr.com/tags/jepsen.

Distributed databases

Georgios Gousios

09 March 2022

Distributed databases