Synchronous vs. Asynchronous replication strategy: Which one is better?

Posted in Uncategorized on September 22, 2011 by swaminathans

If you’re a developer who is looking to pick a highly available database or datastore for your application, you’ve to think of various parameters to make the right choice for your application. Some of the questions you need to hash out include “What query model do I want and what does the datastore offer?”, “What is its consistency guarantees?”, “What is its reliability and availability SLA?” and “What are its scaling dimensions?”. While the process of picking the right datastore warrants a post on its own, I wanted to focus on one specific parameter of this question for this post.

The focus of this post is: “How does a datastore replicate data? Is replication done lazily or synchronously?

One might ask: Why should an application developer (or any datastore consumer) care how a datastore replicates data? The reality is that how a system replicates data has great impact on the datastore’s durability and availability.

Traditionally there are two kinds of replication techniques: synchronous replication and asynchronous replication. A datastore using synchronous replication does not acknowledge a write call until it is acknowledged by its replicas (usually a majority of the replicas).  Examples of pratical datastore that does synchronous replication include Amazon Dynamo (and many of its variants like Cassandra, Voldermort and Riak), Amazon SimpleDB, Amazon RDS Multi-AZ MySQL and Google AppEngine’s High Replication Datastore. On the other hand, a datastore that replicates data asynchronously propagates data to its replica, i.e., when one of the replica gets a write call it will acknowledge to the client right away and in the background it will propagate the writes to its replicas. Examples of systems that use this model include MySQL replication, MongoDB and Google AppEngine’s Master-slave datastore.

Which one is better? Well, it really depends on what you’re planning to use it for. Datastores that synchronously replicates data ensures provides higher durability and availability. Why? Consider the failure scenario where the replica that acknowledged that even if the replica that took a write dies as soon as it acknowledged to the clients, the data is not lost. Clients can read their last acknowledged writes by accessing from the other available replicas. However, in an asynchronously replicated systems, it is not possible for clients to know if their writes were propagated to the other replicas before the failure of the replica that coordinated the writes. If the replica has failed permanently (due to a permanent node failure or so), then you can experience a data loss where you lost the latest updates (durability issue). If the replica has failed temporarily, then clients can still access an old version of the data from the other available replicas. However, they will not be able to perform any writes unless they are willing to deal with merging conflicting updates when the replica comes back online.

So, why would anyone run the risk of picking a lower durable datastore and pick an asynchronously replicated datastore? The reason is because asynchronously replicated datastore provides reduced write latency (since a write is acknowledged locally by a single before being written to other replicas, the write latency is lower) and better read scaling (as adding more replicas does not impact your latency like in the case of synchronously replicated systems).

So, in a nutshell, the rule of thumb I tend to use in picking between these two techniques is: If your application requires high durability and availability, pick a synchronously replicated datastore solution. If your application is OK with losing a small % of writes (e.g., a soft-state or caching system) due to a failed replica (usually the master replica), then you are OK with asynchronous replication system. A great example is how MySQL users use read replicas to do read scaling. Another great example for using asynchronous replication is to use your asynchronous replica as the backup copy (used for disaster recovery).

Reading List for Distributed Systems

Posted in Distributed Systems on September 7, 2011 by swaminathans

I quite often get asked by friends, colleagues who are interested in learning about distributed systems saying “Please tell me what are the top papers and books we need to read to learn more about distributed systems”. I used to write one off emails giving a few pointers. Now that, I’ve asked enough I thought it is a worthwhile exercise to put these together in a single post.

Please feel free to comment if you think there are more posts that needs to be added.


Distributed systems Theory:

I believe these are some of the foundational theory papers you must read before you go on to build large scale systems.

Distributed Consensus:

Paxos is the gold standard of distributed consensus protocols. Amazingly, this is the simplest of all protocols. There is a huge story behind how Paxos paper delayed getting published as the original paper was written in a non-obvious fashion by Lamport J. Paxos was more approachable for general masses after he wrote an abridged version of the paper.

Original part-time parliament:

If that paper is too abstract for you, I recommend reading the lite version: Paxos made Simple

I always believe a theory is well understood only if you understand how others put in practice. Google has documented their experience with Paxos here:

Paxos made live talks about Google’s experience with using Paxos and Chubby, Google’s Paxos based lock service.

Consistency and Ordering in Distributed Systems:

Lamport, L. Time, clocks, and the ordering of events in a distributed system. ACM Communications, 21(7), pp. 558-565, 1978.

L. Lamport, R. Shostak, and M. Pease, The Byzantine Generals Problem, ACM Transactions on Programming Languages and Systems, July 1982, pages 382-401.

Replication in Distributed Databases:

Demers et al., Epidemic algorithms for replicated database maintenance, PODC 1987.

Jerome H. Saltzer and M. Frans Kaashoek, Principles of Computer System Design, Chapter 10: Consistency.

Lindsay, B.G., et. al., “Notes on Distributed Databases”, Research Report RJ2571(33471), IBM Research, July 1979

Distributed Hash Tables:

Distributed Hash Tables are distributed systems that use hashing techniques for routing and membership.  Most of them use consistent hashing as the foundation for routing.

Seminal work on consistent hashing techniques: Consistent Hashing and Random Trees. Application of consistent hashing to caching: Web caching with consistent hashing.

What followed last decade was researchers using consistent hashing techniques to build P2P systems and routing techniques using them. Examples of such systems include Chord, Pastry and Tapestry.

Real life Distributed systems and data Stores:

The following distributed databases papers are seminal and great examples of distributed systems. A must read for people interested in building distributed systems:

Amazon Dynamo: Amazon’s own key-value store

Google File System: Google’s very own distributed file system.

Google BigTable: Google’s distributed datastore.

Map-Reduce: A seminal piece of work that has powered the Hadoop ecosystem

Autopilot: Automatic Data Center Management Michael Isard April 2007.

Update [ 9/13/2011]: Alex Feinberg pointed out his reading list, which is equally impressive as well and not so surprisingly have a great number of papers in common. Thanks Alex!

Filtering effects in cache hierarchies

Posted in Uncategorized on August 31, 2011 by swaminathans

A friend of mine was building a multi-tiered application with a cache fronting a database and was investigating why his database memory hit ratio was not as high as he expected it to be and how to optimize it. This matters only when you are paranoid about latency percentiles at TP99 or TP99.9s. One of the things, he found after a deeper investigation is what he called as “cherry picking” done by the caching tier that the next level did not see requests with high cache locality. A similar observation was made in Web cache research and was referred to filtering effects.

One of the important works done in this area is by Carey Williamson on “On Filter Effects in Web cache hierarchies”. This paper uses trace driven simulation to look into effects of what happens to workload locality when it travels through various cache hierarchies. Important results to call out:
– Proxy caches filter out the most popular documents from a workload. The filtering effect typically changes the Zipf-like document popularity profile of the input stream into a two-part piecewise-linear document popularity profile for the output stream. The upper leftmost part of the plot is significantly flattened.
– The percentage of unique documents referenced and the percentage of onetimer documents tend to increase as the request stream passes through a cache.
– Among the caches in a caching hierarchy, the first-level cache has the most pronounced filtering effect on the workload.

Even though this is a 10 year old work, highly recommended read for folks trying to understand the effects on caching on their systems.

Amazon ElastiCache

Posted in Uncategorized on August 23, 2011 by swaminathans

Last night, my sister team in AWS launced a service I’m very excited about: Amazon Elasticache. Historically, caching has been the one of the most widely used techniques to build scalable web applications where the caches store the most often accessed computation results which take longer (or is harder) to re-compute in the source. In-memory caches are normally used to front databases so that often accessed results can be retrieved from memory faster (see examples of how to use MySQL and memcache together, here). However, to ensure that in-memory cache do not become a scalability bottleneck themselves, distributed cache clusters use techniques like distributed hash tables (DHTs) to ensure that cache cluster can be “scaled out”. As the scale of caching system becomes harder, it is a challenge to manage them in a large scale environment.

Today, AWS has made the process of running a cache cluster easier with a new managed cache offering called Elasticache. A quote from the detail page sums it up well:

“Amazon ElastiCache is a web service that makes it easy to deploy, operate, and scale an in-memory cache in the cloud. Amazon ElastiCache is protocol-compliant with Memcached, a widely adopted memory object caching system, so code, applications, and popular tools that you use today with existing Memcached environments will work seamlessly with the service. Amazon ElastiCache simplifies and offloads the management, monitoring, and operation of in-memory cache environments, enabling you to focus on the differentiating parts of your applications.”

Congratulations team!

Growth of AWS

Posted in Uncategorized on March 4, 2011 by swaminathans

This week, BusinessWeek article posted a great article on Cloud computing and AWS. The one statement that really caught my eyes which highlights our growth is:

“Keeping up with the demand requires frantic expansion: Each day, Jassy’s operation (AWS) adds enough computing muscle to power one whole circa 2000, when it was a $2.8 billion business.

Please take a look at the article here.

Shameless hiring plug: Do you want to be part of the team that builds such disruptive technologies, email me:

IEEE Network Magazine: Cloud Computing Special Issue

Posted in Cloud Computing on January 9, 2011 by swaminathans

I’m co-editing a special issue on Cloud computing for IEEE network magazine. For folks interested, take a look at the CFP below. Submission deadline is Jan 15, 2011.

Call For Papers

Final submissions due 15 January 2011

Cloud Computing is a recent trend in information technology and scientific computing that moves computing and data away from desktop and portable PCs into large Data Centers. Cloud computing is based on a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Cloud Computing opens new perspectives in internetworking technologies, raising new issues in the architecture, design and implementation of existing networks and data centers. The relevant research has just recently gained momentum and the space of potential ideas and solutions is still far from being widely explored.

This special issue of the IEEE Network Magazine will feature articles that discuss networking aspects of cloud computing. Specifically, it aims at delivering the state-of-the-art research on current cloud computing networking topics, and at promoting the networking discipline by bringing to the attention of the community novel problems that must be investigated. Areas of interest include, but are not limited to:

  • Data center architecture
    • Server interconnection
    • Routers (switching technology) for data center deployment
    • Centralization of network administration/routing
  • Energy-efficient cloud networking
    • Low energy routing
    • Green data centers
  • Measurement-based network management
    • Traffic engineering
    • Network anomaly detection
    • Usage-based pricing
  • Security issues in clouds
    • Secure routing
    • Security threats and countermeasures
    • Virtual network security
  • Virtual Networking
    • Virtualized network resource management
    • Virtual cloud storage
  • Futuristic topics
    • Interclouds
    • Cloud computing support for mobile users

Guest Editors

Swami Sivasubramanian
Amazon, USA

Dimitrios Katsaros
University of Thessaly, Greece

George Pallis
University of Cyprus, Cyprus

Athena Vakali
Aristotle University of Thessaloniki, Greece

Datacenter Networks and move towards scalable network architectures

Posted in Cloud Computing on November 2, 2010 by swaminathans

Last week, I was super excited to attend James’s talk on “Datacenter networks are in my way” in our internal talk series called Principals of Amazon. As always, James’s talk always is illuminating. I highly encourage everyone to read James’s post and the slides.

A few takeaways from James’s talk worth calling out:

– Contrary to popular belief, power is not the leading driver for datacenter operational cost. It is actually the server cost (which is about 57%).

– The above leads to the conclusion that techniques like shutting down servers when the server is not being used, while interesting, is not a big return for the investment. Instead, you are better off doing the exact opposite: utilize your existing server investment to the fullest.

– Traditional DC networks are usually oversubscribed and live in a very vertical world where all network components are done by a single vendor and are also built to be more like mainframe with “scale up” (get bigger boxes) model instead of “scale out” model. This is bad for sustainability and reliability.

– To enable higher server utilization, you need your datacenter networks to support full connectivity between hosts and not be oversubscribed.

The above takeaways tell us that we need to build DC networks such that they can be easily scaled (moving from oversubscribed to undersubscribed). To scale the DC networks, we need to build out a scale out DC network architecture and systems like OpenFlow enable that. It is interesting to see that just like what we learnt in distributed systems and datastores is applicable to datacenter networks also: Scale out (horizontal scaling) is in the long run better than scale up (vertical scaling).