Reading List for Distributed Systems
I quite often get asked by friends, colleagues who are interested in learning about distributed systems saying “Please tell me what are the top papers and books we need to read to learn more about distributed systems”. I used to write one off emails giving a few pointers. Now that, I’ve asked enough I thought it is a worthwhile exercise to put these together in a single post.
Please feel free to comment if you think there are more posts that needs to be added.
Distributed systems Theory:
I believe these are some of the foundational theory papers you must read before you go on to build large scale systems.
Paxos is the gold standard of distributed consensus protocols. Amazingly, this is the simplest of all protocols. There is a huge story behind how Paxos paper delayed getting published as the original paper was written in a non-obvious fashion by Lamport J. Paxos was more approachable for general masses after he wrote an abridged version of the paper.
Original part-time parliament: http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf
If that paper is too abstract for you, I recommend reading the lite version: Paxos made Simple
I always believe a theory is well understood only if you understand how others put in practice. Google has documented their experience with Paxos here:
Consistency and Ordering in Distributed Systems:
Lamport, L. Time, clocks, and the ordering of events in a distributed system. ACM Communications, 21(7), pp. 558-565, 1978.
L. Lamport, R. Shostak, and M. Pease, The Byzantine Generals Problem, ACM Transactions on Programming Languages and Systems, July 1982, pages 382-401.
Replication in Distributed Databases:
Demers et al., Epidemic algorithms for replicated database maintenance, PODC 1987.
Jerome H. Saltzer and M. Frans Kaashoek, Principles of Computer System Design, Chapter 10: Consistency.
Lindsay, B.G., et. al., “Notes on Distributed Databases”, Research Report RJ2571(33471), IBM Research, July 1979
Distributed Hash Tables:
Distributed Hash Tables are distributed systems that use hashing techniques for routing and membership. Most of them use consistent hashing as the foundation for routing.
Real life Distributed systems and data Stores:
The following distributed databases papers are seminal and great examples of distributed systems. A must read for people interested in building distributed systems:
Amazon Dynamo: Amazon’s own key-value store
Google File System: Google’s very own distributed file system.
Google BigTable: Google’s distributed datastore.
Map-Reduce: A seminal piece of work that has powered the Hadoop ecosystem
Autopilot: Automatic Data Center Management Michael Isard April 2007.