One of the common problems faced by anyone building large scale distributed systems: how to ensure that only one process (replace this with: worker, server or agent) across a fleet of servers acts on a resource? Before jumping in to how to solve this, let us take a look at the problem more in detail and see where it is applicable:
What are the examples of where we need a single entity to act on a resource?
- Ensure only one server can write to a database or write to a file.
- Ensure that only one server can perform a particular action.
- Ensure that there is a single master that processes all writes (e.g., Google’s use of Chubby for leader election in GFS, map reduce etc).
Coordination across distributed servers
All these scenarios need a simple way to coordinate execution of process and workflows to ensure that only one entity is performing an action. This is one of the classic problems in distributed system as you have to not only elect a leader across a set of distributed processes (leader election) but also detect its failure (failure detection) and then change the appropriate group membership and elect a new leader.
A classic way to solve this problem is to use some form of system that can bring consensus. This can be a single server which can act as the source of truth (not a good idea for obvious reasons of availability or durability), a system that implements a consensus algorithm like Paxos or build a form of leader election on top of a strongly consistent datastore (will get to this later).
What is Paxos?
Paxos is the gold standard in consensus algorithms. It is a distributed consensus protocol (or a family of protocols if you include all its derivatives) designed to reach an agreement across a family of unreliable distributed processes. The reason this is hard is because the participating processes can be unreliable due to failures (e.g., server failure, datacenter failure or network partitions) or malicious intent (byzantine failures).
You can read more about Paxos here:
How can I use Paxos to solve the problem of coordination?
In Amazon, we use consensus algorithms heavily in various parts of the company. A few ways we have solved the problem of solving this:
- Build a consensus library: This is a library that implements Paxos (or one of its derivatives) that the application can consume. Any decision that requires coordination such as election a leader to execute a workflow is passed as a transaction in this library and upon successful completion of the transaction, they proceed. While this approach worked well in the beginning, we quickly realized that this required many of our application developers to be experts in distributed systems. They had to understand the operational characteristics of failure modes and how to debug them. If you think this is easy, suggest reading this paper.
- Lock service: Lock service is a form of consensus service that converts the problem of reaching consensus to handing out locks. Basically, a distributed set of processes compete to acquire a lock and whoever gets the lock has the “baton” to execute a workflow, write to a database etc. This is a very popular model. The advantage with this approach compared to library is twofold:
- It is a service versus product. So, application developers do not have to worry about dealing with operations, various failure modes debugging etc.
- Abstraction of locks is a lot more simpler than trying to model it is a consensus or transaction.
Lock service has been a widely popular means to solve this problem. Amazon has built its own lock service for internal use, Google building Chubby, and open source systems like Zookeeper.
Using a strongly consistent datastore to build lock service
One of the common tricks we have seen being used in the past is to emulate a lock service behavior (or APIs) using a strongly consistent datastore. Basically, a strongly consistent datastore that is replicated, highly available and durable (e.g., DynamoDB) can be used to persist the information on who is holding it, for how long etc.
Now the issue that this solution does not solve is failure detection. Traditionally lock service also provides a simple library that the clients can use to heartbeat with the service. This ensures that when a lock holder is partitioned away, the lock service revokes the lock and the library notifies the lock holder to abort its critical section.
In using a datastore to build lock service APIs, you have to solve this solution. In this front, I came across an interesting post on how Ryan has built a lock service API on DynamoDB. He has outlined the problem, solution and limitations quite well and very interesting read.
Do you need lock service at all?
An obvious question to ask yourself before venturing into any of the approaches above is: Do you need a lock service after all? Sometimes the answer is you don’t need to as executing the same workflow twice may not be an issue. Whereas there are many other cases where you need such an abstraction. So, please make this decision carefully.
Interested in building large scale systems like a distributed lock service, if so email me: swami at amazon dot com.