Reading List for Distributed Systems

I quite often get asked by friends, colleagues who are interested in learning about distributed systems saying “Please tell me what are the top papers and books we need to read to learn more about distributed systems”. I used to write one off emails giving a few pointers. Now that, I’ve asked enough I thought it is a worthwhile exercise to put these together in a single post.

Please feel free to comment if you think there are more posts that needs to be added.

Papers:

Distributed systems Theory:

I believe these are some of the foundational theory papers you must read before you go on to build large scale systems.

Distributed Consensus:

Paxos is the gold standard of distributed consensus protocols. Amazingly, this is the simplest of all protocols. There is a huge story behind how Paxos paper delayed getting published as the original paper was written in a non-obvious fashion by Lamport J. Paxos was more approachable for general masses after he wrote an abridged version of the paper.

Original part-time parliament: http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf

If that paper is too abstract for you, I recommend reading the lite version: Paxos made Simple

I always believe a theory is well understood only if you understand how others put in practice. Google has documented their experience with Paxos here:

Paxos made live talks about Google’s experience with using Paxos and Chubby, Google’s Paxos based lock service.

Consistency and Ordering in Distributed Systems:

Lamport, L. Time, clocks, and the ordering of events in a distributed system. ACM Communications, 21(7), pp. 558-565, 1978.

L. Lamport, R. Shostak, and M. Pease, The Byzantine Generals Problem, ACM Transactions on Programming Languages and Systems, July 1982, pages 382-401.

Replication in Distributed Databases:

Demers et al., Epidemic algorithms for replicated database maintenance, PODC 1987.

Jerome H. Saltzer and M. Frans Kaashoek, Principles of Computer System Design, Chapter 10: Consistency.

Lindsay, B.G., et. al., “Notes on Distributed Databases”, Research Report RJ2571(33471), IBM Research, July 1979

Distributed Hash Tables:

Distributed Hash Tables are distributed systems that use hashing techniques for routing and membership.  Most of them use consistent hashing as the foundation for routing.

Seminal work on consistent hashing techniques: Consistent Hashing and Random Trees. Application of consistent hashing to caching: Web caching with consistent hashing.

What followed last decade was researchers using consistent hashing techniques to build P2P systems and routing techniques using them. Examples of such systems include Chord, Pastry and Tapestry.

Real life Distributed systems and data Stores:

The following distributed databases papers are seminal and great examples of distributed systems. A must read for people interested in building distributed systems:

Amazon Dynamo: Amazon’s own key-value store

Google File System: Google’s very own distributed file system.

Google BigTable: Google’s distributed datastore.

Map-Reduce: A seminal piece of work that has powered the Hadoop ecosystem

Autopilot: Automatic Data Center Management Michael Isard April 2007.

Update [ 9/13/2011]: Alex Feinberg pointed out his reading list, which is equally impressive as well and not so surprisingly have a great number of papers in common. Thanks Alex!

Advertisements

8 Responses to “Reading List for Distributed Systems”

  1. […] Reading List for Distributed Systems « Building Scalable Systems I quite often get asked by friends, colleagues who are interested in learning about distributed systems saying “Please tell me what are the top papers and books we need to read to learn more about distributed systems”. I used to write one off emails giving a few pointers. Now that, I’ve asked enough I thought it is a worthwhile exercise to put these together in a single post. (tags: distributed architecture design) […]

  2. sridharvisu76 Says:

    Excellent list. Thanks a lot Swami

  3. What are the seminal papers in distributed systems? Why?…

    Note: this is slightly biased to the problems of scalable online processing systems (mostly data storage and messaging). As such I may be leaving out papers related to other (equally important) topics such as HPC, security in distributed systems and ma…

  4. I was wondering if there was any specific reason you didn’t include something on failure detectors? Maybe – Chandra and Toueg, “Unreliable failure detectors and reliable distributed systems”

    I don’t believe there can ever be a single definitive list and arguing over the contents is thus a waste of time. Understanding what someone else values or doesn’t seems more likely to lead to insight hence my question above.

    Thank you,

    Dan.

    • Agree with your comments on building a definitive list. This is by no means a complete list. However, as you point, I think I missed major body of work on failure detectors and gossip protocol. This is just an oversight and the one you’re pointing out is one of those seminal works. Will update them soon.

      Thanks Dan!

  5. Good list.

  6. Hi Swami!
    I maintain a blog called Systems We Make (http://www.systemswemake.com/) where I try to curate research in distributed systems. Of late I’ve begun summarizing the papers in an effort to get a more intimate understanding of DS.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: