If you’re a developer who is looking to pick a highly available database or datastore for your application, you’ve to think of various parameters to make the right choice for your application. Some of the questions you need to hash out include “What query model do I want and what does the datastore offer?”, “What is its consistency guarantees?”, “What is its reliability and availability SLA?” and “What are its scaling dimensions?”. While the process of picking the right datastore warrants a post on its own, I wanted to focus on one specific parameter of this question for this post.
The focus of this post is: “How does a datastore replicate data? Is replication done lazily or synchronously?”
One might ask: Why should an application developer (or any datastore consumer) care how a datastore replicates data? The reality is that how a system replicates data has great impact on the datastore’s durability and availability.
Traditionally there are two kinds of replication techniques: synchronous replication and asynchronous replication. A datastore using synchronous replication does not acknowledge a write call until it is acknowledged by its replicas (usually a majority of the replicas). Examples of pratical datastore that does synchronous replication include Amazon Dynamo (and many of its variants like Cassandra, Voldermort and Riak), Amazon SimpleDB, Amazon RDS Multi-AZ MySQL and Google AppEngine’s High Replication Datastore. On the other hand, a datastore that replicates data asynchronously propagates data to its replica, i.e., when one of the replica gets a write call it will acknowledge to the client right away and in the background it will propagate the writes to its replicas. Examples of systems that use this model include MySQL replication, MongoDB and Google AppEngine’s Master-slave datastore.
Which one is better? Well, it really depends on what you’re planning to use it for. Datastores that synchronously replicates data ensures provides higher durability and availability. Why? Consider the failure scenario where the replica that acknowledged that even if the replica that took a write dies as soon as it acknowledged to the clients, the data is not lost. Clients can read their last acknowledged writes by accessing from the other available replicas. However, in an asynchronously replicated systems, it is not possible for clients to know if their writes were propagated to the other replicas before the failure of the replica that coordinated the writes. If the replica has failed permanently (due to a permanent node failure or so), then you can experience a data loss where you lost the latest updates (durability issue). If the replica has failed temporarily, then clients can still access an old version of the data from the other available replicas. However, they will not be able to perform any writes unless they are willing to deal with merging conflicting updates when the replica comes back online.
So, why would anyone run the risk of picking a lower durable datastore and pick an asynchronously replicated datastore? The reason is because asynchronously replicated datastore provides reduced write latency (since a write is acknowledged locally by a single before being written to other replicas, the write latency is lower) and better read scaling (as adding more replicas does not impact your latency like in the case of synchronously replicated systems).
So, in a nutshell, the rule of thumb I tend to use in picking between these two techniques is: If your application requires high durability and availability, pick a synchronously replicated datastore solution. If your application is OK with losing a small % of writes (e.g., a soft-state or caching system) due to a failed replica (usually the master replica), then you are OK with asynchronous replication system. A great example is how MySQL users use read replicas to do read scaling. Another great example for using asynchronous replication is to use your asynchronous replica as the backup copy (used for disaster recovery).