Archive for the Distributed Systems Category

Distributed Locking: When to use it? How?

Posted in Distributed Systems on May 7, 2013 by swaminathans

One of the common problems faced by anyone building large scale distributed systems:  how to ensure that only one process (replace this with: worker, server or agent) across a fleet of servers acts on a resource? Before jumping in to how to solve this, let us take a look at the problem more in detail and see where it is applicable:

What are the examples of where we need a single entity to act on a resource?

  • Ensure only one server can write to a database or write to a file.
  • Ensure that only one server can perform a particular action.
  • Ensure that there is a single master that processes all writes (e.g., Google’s use of Chubby for leader election in GFS, map reduce etc).

Coordination across distributed servers

All these scenarios need a simple way to coordinate execution of process and workflows to ensure that only one entity is performing an action. This is one of the classic problems in distributed system as you have to not only elect a leader across a set of distributed processes (leader election) but also detect its failure (failure detection) and then change the appropriate group membership and elect a new leader.

A classic way to solve this problem is to use some form of system that can bring consensus. This can be a single server which can act as the source of truth (not a good idea for obvious reasons of availability or durability), a system that implements a consensus algorithm like Paxos or build a form of leader election on top of a strongly consistent datastore (will get to this later).

What is Paxos?

Paxos is the gold standard in consensus algorithms. It is a distributed consensus protocol (or a family of protocols if you include all its derivatives)  designed to reach an agreement across a family of unreliable distributed processes. The reason this is hard is because the participating processes can be unreliable due to failures (e.g., server failure, datacenter failure or network partitions) or malicious intent (byzantine failures).

You can read more about Paxos here:

How can I use Paxos to solve the problem of coordination?

In Amazon, we use consensus algorithms heavily in various parts of the company. A few ways we have solved the problem of solving this:

  • Build a consensus library: This is a library that implements Paxos (or one of its derivatives) that the application can consume. Any decision that requires coordination such as election a leader to execute a workflow is passed as a transaction in this library and upon successful completion of the transaction, they proceed. While this approach worked well in the beginning, we quickly realized that this required many of our application developers to be experts in distributed systems. They had to understand the operational characteristics of failure modes and how to debug them. If you think this is easy, suggest reading this paper.
  • Lock service: Lock service is a form of consensus service that converts the problem of reaching consensus to handing out locks. Basically, a distributed set of processes compete to acquire a lock and whoever gets the lock has the “baton” to execute a workflow, write to a database etc. This is a very popular model. The advantage with this approach compared to library is twofold:
  1. It is a service versus product. So, application developers do not have to worry about dealing with operations, various failure modes debugging etc.
  2. Abstraction of locks is a lot more simpler than trying to model it is a consensus or transaction.

Lock service has been a widely popular means to solve this problem. Amazon has built its own lock service for internal use, Google building Chubby, and open source systems like Zookeeper.

Using a strongly consistent datastore to build lock service

One of the common tricks we have seen being used in the past is to emulate a lock service behavior (or APIs) using a strongly consistent datastore. Basically, a strongly consistent datastore that is replicated, highly available and durable (e.g., DynamoDB) can be used to persist the information on who is holding it, for how long etc.

Now the issue that this solution does not solve is failure detection. Traditionally lock service also provides a simple library that the clients can use to heartbeat with the service. This ensures that when a lock holder is partitioned away, the lock service revokes the lock and the library notifies the lock holder to abort its critical section.

In using a datastore to build lock service APIs, you have to solve this solution. In this front, I came across an interesting post on how Ryan has built a lock service API on DynamoDB. He has outlined the problem, solution and limitations quite well and very interesting read.

Do you need lock service at all?

An obvious question to ask yourself before venturing into any of the approaches above is: Do you need a lock service after all? Sometimes the answer is you don’t need to as executing the same workflow twice may not be an issue. Whereas there are many other cases where you need such an abstraction. So, please make this decision carefully.

Interested in building large scale systems like a distributed lock service, if so email me: swami at amazon dot com.

Achieving Rapid Response Times in Large Scale Systems

Posted in Distributed Systems on June 19, 2012 by swaminathans

[Editor’s note: I know I am writing a blog after ages. Hoping to be more regular going forward and maintain a regular schedule]

Recently, I came across this slide deck from Google’s Jeff Dean about achieving rapid response times in large scale systems. The overall summary of the talk is to describe various load management and balancing techniques Google uses to decrease their response times both in average and high percentile.

Some salient points of this talk:

– Increased fan out (aka larger scale) makes it more challenging to keep the latencies low under high percentiles as more things can fail or be slow. Example: You are trying to serve a request by querying across multiple partitions and you are as fast as the slowest partition. This becomes a serious problem in the high percentile (99th or 99.9th).

What intrigued me about this talk and its conclusion which I will get to below is the emphasis on optimizing for TP99 latency. As we called out in the original Amazon dynamo paper, engineering systems to optimize for TP99 latency is hard and require careful optimization at various level such as good engineering practices (prioritized queues, careful management of background queues). Here Jeff’s talks about similar practices and a few more techniques they used to address large scale variability:

  • General good engineering practice to build individual components right (priority queues, better management of background tasks)
  • Better load management at the server/service side: by spreading the load across servers, overload protection and migration of hot spots
  • Better load balancing at the client side by: sending backup requests to another replica dynamically when you find the original replica is slow (interesting question is when to trigger the backup request but one can imagine picking a reasonable value like TP50), better cancellation schemes of backup requests, and finally giving degraded results.
  • Mentioned that backup requests beyond 2 provide diminished returns. I think this totally makes sense and probably due to the power of two choices.

Overall, excited to see the great talk.

Reading List for Distributed Systems

Posted in Distributed Systems on September 7, 2011 by swaminathans

I quite often get asked by friends, colleagues who are interested in learning about distributed systems saying “Please tell me what are the top papers and books we need to read to learn more about distributed systems”. I used to write one off emails giving a few pointers. Now that, I’ve asked enough I thought it is a worthwhile exercise to put these together in a single post.

Please feel free to comment if you think there are more posts that needs to be added.


Distributed systems Theory:

I believe these are some of the foundational theory papers you must read before you go on to build large scale systems.

Distributed Consensus:

Paxos is the gold standard of distributed consensus protocols. Amazingly, this is the simplest of all protocols. There is a huge story behind how Paxos paper delayed getting published as the original paper was written in a non-obvious fashion by Lamport J. Paxos was more approachable for general masses after he wrote an abridged version of the paper.

Original part-time parliament:

If that paper is too abstract for you, I recommend reading the lite version: Paxos made Simple

I always believe a theory is well understood only if you understand how others put in practice. Google has documented their experience with Paxos here:

Paxos made live talks about Google’s experience with using Paxos and Chubby, Google’s Paxos based lock service.

Consistency and Ordering in Distributed Systems:

Lamport, L. Time, clocks, and the ordering of events in a distributed system. ACM Communications, 21(7), pp. 558-565, 1978.

L. Lamport, R. Shostak, and M. Pease, The Byzantine Generals Problem, ACM Transactions on Programming Languages and Systems, July 1982, pages 382-401.

Replication in Distributed Databases:

Demers et al., Epidemic algorithms for replicated database maintenance, PODC 1987.

Jerome H. Saltzer and M. Frans Kaashoek, Principles of Computer System Design, Chapter 10: Consistency.

Lindsay, B.G., et. al., “Notes on Distributed Databases”, Research Report RJ2571(33471), IBM Research, July 1979

Distributed Hash Tables:

Distributed Hash Tables are distributed systems that use hashing techniques for routing and membership.  Most of them use consistent hashing as the foundation for routing.

Seminal work on consistent hashing techniques: Consistent Hashing and Random Trees. Application of consistent hashing to caching: Web caching with consistent hashing.

What followed last decade was researchers using consistent hashing techniques to build P2P systems and routing techniques using them. Examples of such systems include Chord, Pastry and Tapestry.

Real life Distributed systems and data Stores:

The following distributed databases papers are seminal and great examples of distributed systems. A must read for people interested in building distributed systems:

Amazon Dynamo: Amazon’s own key-value store

Google File System: Google’s very own distributed file system.

Google BigTable: Google’s distributed datastore.

Map-Reduce: A seminal piece of work that has powered the Hadoop ecosystem

Autopilot: Automatic Data Center Management Michael Isard April 2007.

Update [ 9/13/2011]: Alex Feinberg pointed out his reading list, which is equally impressive as well and not so surprisingly have a great number of papers in common. Thanks Alex!

WebApps 11

Posted in Distributed Systems on October 6, 2010 by swaminathans

As many of you, the past 4 years in Amazon, I’ve spent on building out various parts of Amazon and AWS infrastructure that will enable Amazon and other AWS users to build highly available and scalable applications.

This also gives me immense pleasure to be part of the program committee for Usenix WebApps 2011. This conference is primarily focused on building highly scalable and Web applications. Looking forward to reviewing some great submissions and the conference.

I highly recommend submitting your ideas.

CFP details below:

WebApps ’11 Call for Papers

2nd USENIX Conference on Web Application Development (WebApps ’11)

June 15–16, 2011
Portland, OR

WebApps ’11 will take place during USENIX Federated Conferences Week, June 12–17, 2011.

Important Dates

  • Submissions due: January 21, 2011, 11:59 p.m. PST (hard deadline, no extensions, no exceptions, so don’t ask)
  • Notification to authors: March 17, 2011
  • Final files due: May 3, 2011

Conference Organizers

Program Chair
Armando Fox, University of California, Berkeley

Program Committee
Adam Barth, Google Inc.
Abdur Chowdhury, Twitter
Jon Howell, Microsoft Research
Collin Jackson, Carnegie Mellon University
Bobby Johnson, Facebook
Emre Kıcıman, Microsoft Research
Michael E. (“Max”) Maximilien, IBM Research
Owen O’Malley, Yahoo! Research
John Ousterhout, Stanford University
Swami Sivasubramanian, Amazon Web Services
Geoffrey M. Voelker, University of California, San Diego
Nickolai Zeldovich, Massachusetts Institute of Technology


The Web is now the dominant platform for delivering interactive applications to hundreds of millions of users. Such applications are now expected to scale effortlessly from tens of users to tens of millions of users in a single day while providing a responsive “always-on” experience. These demands, as well as the new possibilities opened by the proliferation of Web-capable mobile devices, requires that Web apps’ design and operation be elastic, failure-tolerant, and seamlessly scalable, supporting multiple devices and access methods.

Like the inaugural WebApps ’10, WebApps ’11 seeks to attract cutting-edge research that advances the state of the art, not only on novel Web applications but also on infrastructure, tools, and techniques that support the development, analysis/testing, operation, or deployment of those applications.


Possible topics include but are not limited to:

  • Storage for Web-scale applications
  • Techniques for testing and debugging
  • Novel strategies for fault tolerance or high availability in Web apps
  • The Web as an emerging platform in new application areas
  • HCI techniques related specifically to Web apps
  • Measurement, modeling, workload generation, and other tools to aid experimental research on Web apps
  • New and unusual app features or implementation techniques
  • Media delivery applications and infrastructure
  • Client-side libraries, toolkits, plug-ins
  • Server-side frameworks
  • Languages and language engineering advances relevant to Web app development
  • Deployment substrates and technologies (cloud computing, infrastructure as a service, testing as a service, etc.)

More details:


CFP online and in PDF format:

Just Launched: Amazon RDS Multi-AZ

Posted in Cloud Computing, Distributed Systems on May 18, 2010 by swaminathans

Today, my colleagues and I launched a major feature in Amazon RDS called Multi-AZ DBInstances. A Multi-AZ DB Instance performs synchronous replication of data across replicas located in different availability zones.

What benefits does Multi-AZ DB Instance provide?

–          Higher Durability: Your data is synchronously replicated across different availability zones. This guarantees higher levels of durability even in the wake of a disk, instance or availability zone failures.

–          Higher Availability: With Multi-AZ instances, we perform master-slave replication and when we detect that the master is unavailable, we automatically failover to the slave replica. This guarantees higher levels of availability. Contrary to MySQL’s asynchronous replication, since the data is synchronously replicated, when the replica fails over to the secondary replica you will experience no data loss.

You can learn more about all the resilient goodness of RDS Multi-AZ deployments here.

Building Scalable Social Gaming Platform using Clouds

Posted in Cloud Computing, Distributed Systems on February 18, 2010 by swaminathans

This week, I ran into many social gaming related posts.  First thing that surprised me was the social games are not played predominantly by teens and the average social gamer is around 40s.  This illustrates why these social games attract tens of millions of players every day.

When I read the high scalability article about How Zynga scaled Farmville to Harvest 75 Million Players a month, I was intrigued by their scaling challenges and requirements:

* Read-write ratio: Interactive games are write heavy unless traditional web applications. Seems intuitive when you think about the fact every move is recorded in a datastore.
* Users are disturbed by high latencies and variability in latencies:  Note that I called high latencies and latency variations as two different entities. As we noted in Dynamo, Amazon also cares about the variability in latencies (percentiles) and build our services to make sure we can constantly keep the variability in control.
* Dealing with latency and failure characteristics of external dependencies: These applications need to deal with external platforms like Facebook which may or may not be available all the time.

I like the way Zynga approached this problem:

* For solving heavy writes, looks like they have partitioned heavily. Seems reasonable – however, I’m curious to see how they handled “hot spots” (wherein the most active users are the ones constantly generating more data) and whether simple hash-based partitioning is good enough to spread the hot spots.

* For handling latency variations, they went for isolating each component and built graceful degradation at each layer. This is a common practice in building large scale systems . The thing what I would be curious to see is how “gracefully” does their datastore degrades and also which datastore they are using for that.

* Finally, to deal with failures of external dependancies and still meet latency SLAs, looks like they cache the responses of external dependancies.

These are very good lessons for building scalable systems.

An interesting followup to this, I saw that Rightscale (which is apparently helping Zynga run on top of AWS) is using its expertise to offer a social gaming platform for other aspiring Zyngas out there! Seems like an exciting internet industry at its really nascent stages.

Amazon RDS

Posted in Cloud Computing, Distributed Systems on October 27, 2009 by swaminathans

Today, my team launched Amazon RDS, a new AWS service that offers relational database in the AWS Cloud. Amazon RDS provides a managed relational database in the cloud and the service does the heavy lifting done by DBAs such as provisioning, monitoring database health, managing automated backups, point-in-time recovery, creating/restoring snapshots, adding storage space on the fly, changing compute power for your database, etc.., all through simple Web service calls. Get started in few minutes!

A Word about Persistence: One size does not fit all.

The scalability, reliability and performance of a platform is heavily dependent on how it manages their data. In Amazon, we believe that when it comes to persistence: one size does not fit all. Pick the right data management platform based on your application’s needs. For people, who require a highly available and highly durable key-value store, we have Amazon S3. For people requiring raw disk in the cloud – we have Amazon EBS. For, people who want to store and query structured data in the cloud and don’t want to actively manage its scalability, we have Amazon SimpleDB.

One of the biggest features our customer has been asking us has been to provide a managed relational database service in the cloud. The need for their relational database can be due to various reasons: programming familiarity, dealing with existing applications already built for relational databases, need for complex transactions and joins which are relational databases’ forte. If you fall in this category, you will be happy with Amazon RDS.

For more details on Amazon RDS, take a look at:

GFS Evolution: Few thoughts…

Posted in Distributed Systems on August 20, 2009 by swaminathans

I finally caught up with the ACM Queue’s interview of Sean Quinlan on GFS Evolution. A small recap of the article for folks who haven’t read the article. GFS, Google’s FileSystem, has been used extensively in Google for more than 10 years (see GFS paper ).

In this interview, Sean talks about some of the shortcomings of original GFS design and what were the challenges they faced when many more applications started using GFS. He talks about some of the biggest issues they ran into GFS was having a single master in charge of a FS cluster where they ran out of how much metadata a single master could keep in its memory thereby having an inherent limit on the number of files a GFS cluster can run.

Later in the interview, he talks about a new version of GFS they are building that uses a distributed master model where they can add more machines to the GFS cluster and the machines will be able to distribute the load of replication, chunking automatically. Clearly, this will handle more load, more files, will be provide higher availability and better performance.

Few things that intrigued me about this interview:

(i) GFS, which was originally built for batch file processing, has evolved to support more online applications like gmail. This introduced new performance, availability and durability requirements.

(ii) How the changing application patterns have driven the design of GFS to a more distributed model that meet the demanding availability and performance needs of online applications.

(iii) The experience with “loose” (eventual) consistency model in GFS and how they handled different failure modes. Looks like their biggest issue was in dealing with client failures as clients were in charge of data reconciliation (which is one of the biggest challenges with eventual consistency). Looks like to avoid these issues, they are moving to a single writer per file model, basically serializing all the writes. Seems like a reasonable approach to provide a tighter bound on consistency (at the expense of possibly reduced “write availability”).

Overall, this was a very insightful interview for me and it is interesting to see how similar some of these problems are what Amazon has seen and solved in the past.

I am really looking forward to read a new SOSP/OSDI paper on GFS v2.