AWS re:Invent

Posted in Uncategorized on November 10, 2014 by swaminathans


This week is the busiest week for many of us in AWS – this is the re:Invent time. Our annual cloud conference that will be packed with series of announcements and some of the best talks.  I just wanted to highlight some of the talks:

  • Keynotes: Needless to say, Wednesday and Thursday keynotes by Andy and Werner will be filled with customer highlights, announcements etc.
  • Spotlight talks: Spotlight talks are an exciting set of talks. These talks are meant to be technically deep and informative for attendees. Some of the speakers are our Distinguished engineers like James Hamilton, Allan Vermuellen (Creator of S3). A few spotlight talks to watch out (note all are live streamed, please register here):
    • SPOT 301: James Hamilton’s talk on AWS Pace of Innovation. James is an awesome speaker and he is going to talk some of the recent innovations in Networking and databases.
    • SPOT 302:  (Shameless plug) This is a talk I am giving with Allan Vermuellen (VP/Distinguished Engineer). Allan built Amazon S3 and is a good friend. We will be talking about core distributed systems primitives that form the basis for various AWS services. Al and I have worked together now for 8 years and built various core distributed systems together and will be fun to talk with him on distributed systems – if you are interested in large scale systems stop by!
    • SPOT 303: This is a conversation talk where we talk to some of our top AWS customers about their best database practices. If you want to hear from how some of our top AWS customers are building stateful services on AWS, their best practices on how they choose databases, their reliability, scalability, and operational best practices. I expect this to be a good learning experience for both experts and beginning customers.
    • SPOT 305: This is a talk I am very excited to see where Khawaja (who is an awesome speaker) and Marvin will be talking through building event driven applications on AWS. They have a cool demo as well!

There are still too many talks I couldn’t list here – otherwise this post will run for multiple pages :). Take a look at detailed agenda here.

If you are in reinvent and want to chat about AWS, Databases or Big data, or just want to meet my team that builds DynamoDB or ElastiCache to just say “Hi”  – stop by the AWS Databases booth or please email me:

Amazon’s First Annual Ph.D. Symposium: Program

Posted in Cloud Computing on November 21, 2013 by swaminathans

As I had called out earlier, Amazon is conducting its first annual Ph.D. symposium where we invite Ph.D. students around the world to come and present their research topics in the form of talks and posters on three major topics:  Cloud computing, Operations Research and Data Mining.

We had a great list of submissions from Ph.D. candidates  based out of top schools in USA, Canada and Europe. This made our selection process hard and competitive.Our acceptance rate was in the single digit percentage which is extremely low for a research symposium carried out for the first team.

This enabled us to generate a great program that stretched over a single day with three parallel tracks: Cloud Computing or AWS, Machine Learning/Computer Vision and Operations Research. Credit goes to Jay Little from Amazon (and his team) for putting together such a great program for its symposium.

I wanted to highlight a few talks that were selected in these tracks.

Cloud Computing 

Justin Debrabant, Brown University.

Beyond Main Memory: Anti-Caching in Main Memory Database Systems.

The traditional wisdom for building disk-based relational database management systems (DBMS) is to organize data in heavily-encoded blocks stored on disk, with a main memory block cache. In order to improve performance given high disk latency, these systems use a multi-threaded architecture with dynamic record-level locking that allows multiple transactions to access the database at the same time. Previous research has shown that this results in substantial over- head for on-line transaction processing (OLTP) applications.

The next generation DBMSs seek to overcome these limitations with architecture based on main memory resident data. To overcome the restriction that all data fit in main memory, we propose a new technique, called anti-caching, where cold data is moved to disk in a transactionally-safe manner as the database grows in size. Because data initially resides in memory, an anti-caching architecture reverses the traditional storage hierarchy of disk-based systems. Main memory is now the primary storage device.

We implemented a prototype of our anti-caching proposal in a high-performance, main memory OLTP DBMS and performed a series of experiments across a range of database sizes, workload skews, and read/write mixes. We compared its performance with an open-source, disk-based DBMS optionally fronted by a distributed main memory cache. Our results show that for higher skewed workloads the anti-caching architecture has a performance advantage over either of the other architectures tested of up to 9× for a data size 8× larger than memory.

In addition, this talk will encompass ongoing work on adapting OLTP systems to use non-volatile RAM (NVRAM). With latency profiles inline with current DRAM speeds and persistent writes, NVRAM technologies have potential to significantly impact current OLTP architectures. We will discuss several possible architectures for NVRAM-based systems including an extension to the anti-caching architecture originally designed with disk as a backend. Finally, we will present early benchmark results attained using a NVM emulator constructed by our collaborators at Intel.

Rahul Potharaju, Purdue University

Network failures are a significant contributor to system downtime and service unavailability. To track troubleshooting at various levels (e.g., network, server, software), operators typically deploy trouble ticket systems which log all the steps from opening a ticket (e.g., customer complaint) till its resolution. Unfortunately, while tickets carry valuable information for network management, analyzing them to do problem inference is extremely difficult — fixed fields are often inaccurate or incomplete, and the free-form text is mostly written in natural language. To analyze this free-form text, we have built NetSieve [NSDI’13], a system that aims to automatically analyze ticket text written in natural language to infer the problem symptoms, troubleshooting activities and resolution actions.

In this talk, I will present the design philosophy behind NetSieve and the lessons we learnt by applying our system on several large datasets. The goal of my talk is to motivate the research community to focus on these unexplored yet important datasets as they contain actionable data to understand and improve system design. For instance, the lessons learnt by applying these techniques to vulnerability databases can be applied towards improving the security of deployed systems, understanding the common pitfalls of writing secure programs etc. Similarly, lessons learnt from customer review repositories (e.g., Amazon) can be applied towards understanding problems faced by a majority of customers and improving future product designs.

Shrutarshi Basu, Cornell University

Merlin : A Unified Framework for Network Management

Merlin is a new network, high-level management framework. With Merlin, administrators express network policy using programs in a declarative language based on logical predicates and regular expressions. The Merlin compiler automatically partitions these programs into components that can be placed on a variety of devices including switches, middleboxes, and end hosts. It uses a constraint solver based on parameterizable heuristics to determine allocations of network-wide resources such as paths and bandwidth. To facilitate management of federated networks, Merlin provides mechanisms for delegating management of sub-policies to tenants, along with tools for verifying that delegated sub-policies do not violate global constraints. Overall, Merlin simplifies the task of network administration by providing high-level abstractions for specifying network policy and scalable infrastructure for enforcing them.

Machine Learning/Computer Vision/Natural Language Processing:

Bin Zhao, CMU

With the widespread availability of low-cost devices capable of photo shooting and high-volume video record- ing, we are facing explosion of both image and video data. The sheer volume of such visual data poses both challenges and opportunities in machine learning and computer vision research.

In image classification, most of previous research has focused on small to medium-scale data sets, containing objects from dozens of categories. However, we could easily access images spreading thousands of categories. Unfortunately, despite the well-known advantages and recent advancements of multi-class classification tech- niques in machine learning, complexity concerns have driven most research on such super large-scale data set back to simple methods such as nearest neighbor search, one-vs-one or one-vs-rest approach. However, facing image classification problem with such huge task space, it is no surprise that these classical algorithms, often favored for their simplicity, will be brought to their knees not only because of the training time and storage cost they incur, but also because of the conceptual awkwardness of such algorithms in massive multi-class paradigms. Therefore, it is our goal to directly address the bigness of image data, not only the large number of training images and high-dimensional image features, but also the large task space. Specifically, we present algorithms capable of efficiently and effectively training classifiers that could differentiate tens of thousands of image classes.

Similar to images, one of the major difficulties in video analysis is also the huge amount of data, in the sense that videos could be hours long or even endless. However, it is often true that only a small portion of video contains important information. Consequently, algorithms that could automatically detect unusual events within streaming or archival video would significantly improve the efficiency of video analysis and save valuable human attention for only the most salient contents. Moreover, given lengthy recorded videos, such as those captured by digital cameras on mobile phones, or surveillance cameras, most users do not have the time or energy to edit the video such that only the most salient and interesting part of the original video is kept. To this end, we also develop algorithm for automatic video summarization.

Therefore, the goal of our research can be summarized as follows: We aim to design machine learning algorithms to automatically analyze and understand large-scale image and video data. Specifically, we design algorithms to address the bigness in image categorization – not only in the form of large number of data points and/or high-dimensional features, but also the large task space. We aim to scale our algorithm to image collections with tens of thousands of classes. We also propose algorithm to address the bigness of video stream, with hours in temporal length, or even endless, and automatically distill such videos to identify interesting events and summarize its contents.

Ying Liu, MIT:

Gaussian Graphical Models (GGMs) or Gauss Markov random fields are widely used in many applications, and the trade-off between the modeling capacity and the efficiency of learning and inference has been an important research problem. In this paper, we study the family of GGMs with small feedback vertex sets (FVSs), where an FVS is a set of nodes whose removal breaks all the cycles. Exact inference such as computing the marginal distributions and the partition function has complexity $O(k^{2}n)$  using message-passing algorithms, where k  is the size of the FVS, and n  is the total number of nodes. We propose efficient structure learning algorithms for two cases: 1) All nodes are observed, which is useful in modeling social or flight networks where the FVS nodes often correspond to a small number of high-degree nodes, or hubs, while the rest of the networks is modeled by a tree. Regardless of the maximum degree, without knowing the full graph structure, we can exactly compute the maximum likelihood estimate in $O(kn^2+n^2\log n)$  if the FVS is known or in polynomial time if the FVS is unknown but has bounded size. 2) The FVS nodes are latent variables, where structure learning is equivalent to decomposing a inverse covariance matrix (exactly or approximately) into the sum of a tree-structured matrix and a low-rank matrix. By incorporating efficient inference into the learning steps, we can obtain a learning algorithm using alternating low-rank correction with complexity $O(kn^{2}+n^{2}\log n)$  per iteration. We also perform experiments using both synthetic data as well as real data of flight delays to demonstrate the modeling capacity with FVSs of various sizes.

Hossein Azari Soufiani, Harvard:

Rank data can be modeled using Random Utility Theory by assigning a real-valued score to each alternative from a parameterized distribution, and then ranking the alternatives according to scores. Our studies have enabled the estimation of parameters for random utility models (RUMs) with exponential families as noise. We also extend RUM models by defining an RUM mixture model with multiple preference types. We identify conditions that enable fast inference through MC-EM, providing identifiability of the models and approximate log-concavity of the likelihood function.

Operations Research

Aly Megahed, Georgia Tech

Topics in Supply Chain Planning in Deterministic and Stochastic Environments.

The problem of supply chain planning became vital to the business practices of most manufacturing and service organizations in today’s global marketplace. In this talk, we present three unified supply chain planning problems.

For the first problem, motivated by the planning challenges faced by of one of the world’s largest manufacturers of wind turbines, we present a comprehensive tactical supply chain planning model for wind turbine manufacturing. The model is multi-period, multi-echelon, and multi-commodity. It explicitly incorporates delay dependent backorder penalties with any general cost structure for the first time. We present numerical results that show the significant impact of this capability. We also show the effect of backorder penalties on different supply chain costs.

We extend the first problem to include uncertainty, and develop a two-stage stochastic programming model for this second problem. The considered supply uncertainty combines random yield and stochastic lead times, and is thus the most general form of such uncertainty to date. Theoretical and computational results regarding the impact of supplier uncertainty/unreliability on procurement decisions are derived.

In our third problem, we extend the first problem by including procurement with time-aggregated quantity discounts, where the discounts are given on aggregated order quantities while the orders are placed on periodic basis (e.g., total annual orders that were placed on a monthly basis). We develop a Mixed Integer Programming (MIP) model for the problem. The model is computationally very challenging to solve. Leading commercial solvers such as CPLEX take hours to generate good feasible solutions for many tested instances. We develop a pre-solve algorithm that reduces the number of binary variables and a MIP based loval search heuristic that quickly generates good feasible solutions. Preliminary results show that our solution methodology is promising.

Rattachut Tangsucheeva, Penn State

Why Firms Go Bankrupt? A Dynamics Perspective

Managing modern supply chains involves dealing with complex dynamics of materials, information, and cash flows around the globe, and is a key determinant of business success today. One of the well-recognized challenges in supply chain management is the inventory bullwhip effect in which demand forecast errors are amplified as it propagates upstream from a retailer, in part because of lags and errors in information flows. Adverse effects of such bullwhip effect include excessive inventory and wasteful swings in manufacturing production. In this research we postulate that there can be a corresponding bullwhip in the cash flow across the supply chain and it can be one of many reasons why firms run out of cash and become bankrupt. Furthermore, we explore ways to predict its impact by using a cash flow forecasting model and identify a potential strategy to engineer a solution for controlling working capital by using portfolio optimization theory.

Sarah Ahmadian, Waterloo

Mobile Facility Location

I will talk about the Mobile Facility Location. This problem is motivated by a model where a manufacturer wants to set up distribution centers so that it can send products to these centers (in one shipment) and the products needed at each retail center can be shipped from the closest distribution center.

Amazon’s 1st Annual Ph.D. Symposium

Posted in Uncategorized on October 2, 2013 by swaminathans

The Amazon’s University Programs is pleased to announce the first annual PhD Research Symposium! Within Amazon there are many applications for research science including Operations Research, Distributed Systems, Storage, Machine Learning and Computer Vision.

The Symposium will be held at the Amazon headquarters in Seattle, WA on Friday, November 22nd. The conference will feature talks by Amazon researchers and PhD students and a poster session for students to exhibit their research. The day will conclude with an evening mixer and dinner. The audience will include Amazon Researchers and members of Amazon’s Principal Engineering Community.

Amazon groups presenting will include:

·         Machine Learning

·         Supply Chain

·         Transaction Risk Management

·         Digital Products

·         Amazon Web Services

The symposium is currently accepting applications from students wishing to attend. Travel and hotel expenses will be covered. The deadline for submission is October 9th.

You are invited to apply to attend the conference and also provide a 60 minute talk or a poster for the afternoon poster session.

Application Requirements:

Applications should be no longer than 300 words

·         A brief description of why you would like to attend the symposium (no longer than 300 words)

·         If you would like to participate in delivering a lecture (60 minutes in length),  include a brief abstract of the lecture topic

·         If you would like to display a poster, include an abstract of the lecture topic

·         A current CV no longer than 6 pages including publication history and links to papers if available

Applications should be sent to Please submit your application in the form of a pdf or MS Word document. Please also add the following fields to the subject line of your email: (1) “Application for PhD Symposium” (2) Your current field of study, (3) Your current School.

Accepted attendees will receive notice by October 18th.

Distributed Locking: When to use it? How?

Posted in Distributed Systems on May 7, 2013 by swaminathans

One of the common problems faced by anyone building large scale distributed systems:  how to ensure that only one process (replace this with: worker, server or agent) across a fleet of servers acts on a resource? Before jumping in to how to solve this, let us take a look at the problem more in detail and see where it is applicable:

What are the examples of where we need a single entity to act on a resource?

  • Ensure only one server can write to a database or write to a file.
  • Ensure that only one server can perform a particular action.
  • Ensure that there is a single master that processes all writes (e.g., Google’s use of Chubby for leader election in GFS, map reduce etc).

Coordination across distributed servers

All these scenarios need a simple way to coordinate execution of process and workflows to ensure that only one entity is performing an action. This is one of the classic problems in distributed system as you have to not only elect a leader across a set of distributed processes (leader election) but also detect its failure (failure detection) and then change the appropriate group membership and elect a new leader.

A classic way to solve this problem is to use some form of system that can bring consensus. This can be a single server which can act as the source of truth (not a good idea for obvious reasons of availability or durability), a system that implements a consensus algorithm like Paxos or build a form of leader election on top of a strongly consistent datastore (will get to this later).

What is Paxos?

Paxos is the gold standard in consensus algorithms. It is a distributed consensus protocol (or a family of protocols if you include all its derivatives)  designed to reach an agreement across a family of unreliable distributed processes. The reason this is hard is because the participating processes can be unreliable due to failures (e.g., server failure, datacenter failure or network partitions) or malicious intent (byzantine failures).

You can read more about Paxos here:

How can I use Paxos to solve the problem of coordination?

In Amazon, we use consensus algorithms heavily in various parts of the company. A few ways we have solved the problem of solving this:

  • Build a consensus library: This is a library that implements Paxos (or one of its derivatives) that the application can consume. Any decision that requires coordination such as election a leader to execute a workflow is passed as a transaction in this library and upon successful completion of the transaction, they proceed. While this approach worked well in the beginning, we quickly realized that this required many of our application developers to be experts in distributed systems. They had to understand the operational characteristics of failure modes and how to debug them. If you think this is easy, suggest reading this paper.
  • Lock service: Lock service is a form of consensus service that converts the problem of reaching consensus to handing out locks. Basically, a distributed set of processes compete to acquire a lock and whoever gets the lock has the “baton” to execute a workflow, write to a database etc. This is a very popular model. The advantage with this approach compared to library is twofold:
  1. It is a service versus product. So, application developers do not have to worry about dealing with operations, various failure modes debugging etc.
  2. Abstraction of locks is a lot more simpler than trying to model it is a consensus or transaction.

Lock service has been a widely popular means to solve this problem. Amazon has built its own lock service for internal use, Google building Chubby, and open source systems like Zookeeper.

Using a strongly consistent datastore to build lock service

One of the common tricks we have seen being used in the past is to emulate a lock service behavior (or APIs) using a strongly consistent datastore. Basically, a strongly consistent datastore that is replicated, highly available and durable (e.g., DynamoDB) can be used to persist the information on who is holding it, for how long etc.

Now the issue that this solution does not solve is failure detection. Traditionally lock service also provides a simple library that the clients can use to heartbeat with the service. This ensures that when a lock holder is partitioned away, the lock service revokes the lock and the library notifies the lock holder to abort its critical section.

In using a datastore to build lock service APIs, you have to solve this solution. In this front, I came across an interesting post on how Ryan has built a lock service API on DynamoDB. He has outlined the problem, solution and limitations quite well and very interesting read.

Do you need lock service at all?

An obvious question to ask yourself before venturing into any of the approaches above is: Do you need a lock service after all? Sometimes the answer is you don’t need to as executing the same workflow twice may not be an issue. Whereas there are many other cases where you need such an abstraction. So, please make this decision carefully.

Interested in building large scale systems like a distributed lock service, if so email me: swami at amazon dot com.

A quick recap..

Posted in Uncategorized on April 19, 2013 by swaminathans

I got a few emails from friends asking why I am not blogging anymore. So, I owe folks an explanation:).

Yes, it has been 10 months since I blogged. Time flies given how busy things are in Amazon. Moreover, given how bad my back got injured, I had to take a month off to recover hence completely had to take my hands off the keyboard for a while. 

I wanted to recap a few significant features my team launched in the past few months:

DynamoDB Local secondary indexes: Starting today, DynamoDB provides our customers the ability to query on non-primary key attributes (within a hash bucket). We call this feature, local secondary indices. Folks who have built partitioned datastore on relational databases using hash based partitioning will be familiar with the pattern of running scoped complex queries within a hash partition. Local secondary indexes let you do just that. We have seen a huge amount of developers migrating their database from relational databases to DynamoDB and this feature will help this migration lot sooner.

DynamoDB Price reduction: We dropped a 3 datacenter replicated SSD-backed storage device by 75% to $0.25 per GB-month and the throughput pricing by 85%. I am excited to be part of the team that has been working hard over the past year to improve storage density and bring down the costs of our underlying hardware platform. We have also made significant improvements to our software by optimizing our storage engine, replication system and various other internal components. 


Auto-discovery in Elasticache:  A common challenge in distributed systems is discovery of new members. For instance, in the Dynamo paper, I talked about using gossip based protocols for discovery. However, one of the things we found when we talked to our customers using memcache (in our Elasticache service) is that the problem of auto discovery is not solved for caching layers. For instance, a cache cluster membership is hardcoded in the form of config files in client boxes. When a cache node is added or removed, the onus is on customers to update the membership configuration files. To get around this manual step, a few folks resorted to running some memcache proxy like moxi which introduced possibly additional latency and bottlenecks. Instead, systems like Dynamo (see my earlier paper) use client libraries that “learn” the latest membership and appropriately reroute. We employed a similar technique in Elasticache where we built a smart membership solution that periodically exchanges this information with its cache clients. This enables customers to automatically “discover” change in cache membership. 

Long Polling with SQS: The Amazon Simple Queue Service (SQS) is a highly scalable, reliable and elastic queuing service that ‘just works’. It has been widely used by our customers and has been a critical building block for building asynchronous distributed systems. With long polling, SQS instead waits for a message to become available and sends it to the client if the message arrives within a customer-defined time period.

I am really proud of what my team has done in the past few months and I am really excited given what we have coming in the next few months. These are exciting times!




Achieving Rapid Response Times in Large Scale Systems

Posted in Distributed Systems on June 19, 2012 by swaminathans

[Editor’s note: I know I am writing a blog after ages. Hoping to be more regular going forward and maintain a regular schedule]

Recently, I came across this slide deck from Google’s Jeff Dean about achieving rapid response times in large scale systems. The overall summary of the talk is to describe various load management and balancing techniques Google uses to decrease their response times both in average and high percentile.

Some salient points of this talk:

– Increased fan out (aka larger scale) makes it more challenging to keep the latencies low under high percentiles as more things can fail or be slow. Example: You are trying to serve a request by querying across multiple partitions and you are as fast as the slowest partition. This becomes a serious problem in the high percentile (99th or 99.9th).

What intrigued me about this talk and its conclusion which I will get to below is the emphasis on optimizing for TP99 latency. As we called out in the original Amazon dynamo paper, engineering systems to optimize for TP99 latency is hard and require careful optimization at various level such as good engineering practices (prioritized queues, careful management of background queues). Here Jeff’s talks about similar practices and a few more techniques they used to address large scale variability:

  • General good engineering practice to build individual components right (priority queues, better management of background tasks)
  • Better load management at the server/service side: by spreading the load across servers, overload protection and migration of hot spots
  • Better load balancing at the client side by: sending backup requests to another replica dynamically when you find the original replica is slow (interesting question is when to trigger the backup request but one can imagine picking a reasonable value like TP50), better cancellation schemes of backup requests, and finally giving degraded results.
  • Mentioned that backup requests beyond 2 provide diminished returns. I think this totally makes sense and probably due to the power of two choices.

Overall, excited to see the great talk.

Amazon DynamoDB

Posted in Uncategorized on February 16, 2012 by swaminathans

I’m four weeks late to blog about this as I was busy with launching this service but wanted to cover this in my blog:

I’m super excited to tell folks that we launched Amazon DynamoDB, a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. We have lots of interesting materials and getting started guides. A few things I encourage to take a look at are:

Detail page:

Introduction video:

Werner’s blog:

Hadoop/EMR integration HOWTOs:

I don’t want to rehash things stated in these links but wanted to make sure I provide a simple link.

We are also talking about Amazon DynamoDB in a few events in the next couple of weeks:

1. NoSQL meetup in Seattle on Feb 22

2. Strata 2012.

Hope to see many of you in these venues!



Synchronous vs. Asynchronous replication strategy: Which one is better?

Posted in Uncategorized on September 22, 2011 by swaminathans

If you’re a developer who is looking to pick a highly available database or datastore for your application, you’ve to think of various parameters to make the right choice for your application. Some of the questions you need to hash out include “What query model do I want and what does the datastore offer?”, “What is its consistency guarantees?”, “What is its reliability and availability SLA?” and “What are its scaling dimensions?”. While the process of picking the right datastore warrants a post on its own, I wanted to focus on one specific parameter of this question for this post.

The focus of this post is: “How does a datastore replicate data? Is replication done lazily or synchronously?

One might ask: Why should an application developer (or any datastore consumer) care how a datastore replicates data? The reality is that how a system replicates data has great impact on the datastore’s durability and availability.

Traditionally there are two kinds of replication techniques: synchronous replication and asynchronous replication. A datastore using synchronous replication does not acknowledge a write call until it is acknowledged by its replicas (usually a majority of the replicas).  Examples of pratical datastore that does synchronous replication include Amazon Dynamo (and many of its variants like Cassandra, Voldermort and Riak), Amazon SimpleDB, Amazon RDS Multi-AZ MySQL and Google AppEngine’s High Replication Datastore. On the other hand, a datastore that replicates data asynchronously propagates data to its replica, i.e., when one of the replica gets a write call it will acknowledge to the client right away and in the background it will propagate the writes to its replicas. Examples of systems that use this model include MySQL replication, MongoDB and Google AppEngine’s Master-slave datastore.

Which one is better? Well, it really depends on what you’re planning to use it for. Datastores that synchronously replicates data ensures provides higher durability and availability. Why? Consider the failure scenario where the replica that acknowledged that even if the replica that took a write dies as soon as it acknowledged to the clients, the data is not lost. Clients can read their last acknowledged writes by accessing from the other available replicas. However, in an asynchronously replicated systems, it is not possible for clients to know if their writes were propagated to the other replicas before the failure of the replica that coordinated the writes. If the replica has failed permanently (due to a permanent node failure or so), then you can experience a data loss where you lost the latest updates (durability issue). If the replica has failed temporarily, then clients can still access an old version of the data from the other available replicas. However, they will not be able to perform any writes unless they are willing to deal with merging conflicting updates when the replica comes back online.

So, why would anyone run the risk of picking a lower durable datastore and pick an asynchronously replicated datastore? The reason is because asynchronously replicated datastore provides reduced write latency (since a write is acknowledged locally by a single before being written to other replicas, the write latency is lower) and better read scaling (as adding more replicas does not impact your latency like in the case of synchronously replicated systems).

So, in a nutshell, the rule of thumb I tend to use in picking between these two techniques is: If your application requires high durability and availability, pick a synchronously replicated datastore solution. If your application is OK with losing a small % of writes (e.g., a soft-state or caching system) due to a failed replica (usually the master replica), then you are OK with asynchronous replication system. A great example is how MySQL users use read replicas to do read scaling. Another great example for using asynchronous replication is to use your asynchronous replica as the backup copy (used for disaster recovery).

Reading List for Distributed Systems

Posted in Distributed Systems on September 7, 2011 by swaminathans

I quite often get asked by friends, colleagues who are interested in learning about distributed systems saying “Please tell me what are the top papers and books we need to read to learn more about distributed systems”. I used to write one off emails giving a few pointers. Now that, I’ve asked enough I thought it is a worthwhile exercise to put these together in a single post.

Please feel free to comment if you think there are more posts that needs to be added.


Distributed systems Theory:

I believe these are some of the foundational theory papers you must read before you go on to build large scale systems.

Distributed Consensus:

Paxos is the gold standard of distributed consensus protocols. Amazingly, this is the simplest of all protocols. There is a huge story behind how Paxos paper delayed getting published as the original paper was written in a non-obvious fashion by Lamport J. Paxos was more approachable for general masses after he wrote an abridged version of the paper.

Original part-time parliament:

If that paper is too abstract for you, I recommend reading the lite version: Paxos made Simple

I always believe a theory is well understood only if you understand how others put in practice. Google has documented their experience with Paxos here:

Paxos made live talks about Google’s experience with using Paxos and Chubby, Google’s Paxos based lock service.

Consistency and Ordering in Distributed Systems:

Lamport, L. Time, clocks, and the ordering of events in a distributed system. ACM Communications, 21(7), pp. 558-565, 1978.

L. Lamport, R. Shostak, and M. Pease, The Byzantine Generals Problem, ACM Transactions on Programming Languages and Systems, July 1982, pages 382-401.

Replication in Distributed Databases:

Demers et al., Epidemic algorithms for replicated database maintenance, PODC 1987.

Jerome H. Saltzer and M. Frans Kaashoek, Principles of Computer System Design, Chapter 10: Consistency.

Lindsay, B.G., et. al., “Notes on Distributed Databases”, Research Report RJ2571(33471), IBM Research, July 1979

Distributed Hash Tables:

Distributed Hash Tables are distributed systems that use hashing techniques for routing and membership.  Most of them use consistent hashing as the foundation for routing.

Seminal work on consistent hashing techniques: Consistent Hashing and Random Trees. Application of consistent hashing to caching: Web caching with consistent hashing.

What followed last decade was researchers using consistent hashing techniques to build P2P systems and routing techniques using them. Examples of such systems include Chord, Pastry and Tapestry.

Real life Distributed systems and data Stores:

The following distributed databases papers are seminal and great examples of distributed systems. A must read for people interested in building distributed systems:

Amazon Dynamo: Amazon’s own key-value store

Google File System: Google’s very own distributed file system.

Google BigTable: Google’s distributed datastore.

Map-Reduce: A seminal piece of work that has powered the Hadoop ecosystem

Autopilot: Automatic Data Center Management Michael Isard April 2007.

Update [ 9/13/2011]: Alex Feinberg pointed out his reading list, which is equally impressive as well and not so surprisingly have a great number of papers in common. Thanks Alex!

Filtering effects in cache hierarchies

Posted in Uncategorized on August 31, 2011 by swaminathans

A friend of mine was building a multi-tiered application with a cache fronting a database and was investigating why his database memory hit ratio was not as high as he expected it to be and how to optimize it. This matters only when you are paranoid about latency percentiles at TP99 or TP99.9s. One of the things, he found after a deeper investigation is what he called as “cherry picking” done by the caching tier that the next level did not see requests with high cache locality. A similar observation was made in Web cache research and was referred to filtering effects.

One of the important works done in this area is by Carey Williamson on “On Filter Effects in Web cache hierarchies”. This paper uses trace driven simulation to look into effects of what happens to workload locality when it travels through various cache hierarchies. Important results to call out:
– Proxy caches filter out the most popular documents from a workload. The filtering effect typically changes the Zipf-like document popularity profile of the input stream into a two-part piecewise-linear document popularity profile for the output stream. The upper leftmost part of the plot is significantly flattened.
– The percentage of unique documents referenced and the percentage of onetimer documents tend to increase as the request stream passes through a cache.
– Among the caches in a caching hierarchy, the first-level cache has the most pronounced filtering effect on the workload.

Even though this is a 10 year old work, highly recommended read for folks trying to understand the effects on caching on their systems.