I finally caught up with the ACM Queue’s interview of Sean Quinlan on GFS Evolution. A small recap of the article for folks who haven’t read the article. GFS, Google’s FileSystem, has been used extensively in Google for more than 10 years (see GFS paper ).
In this interview, Sean talks about some of the shortcomings of original GFS design and what were the challenges they faced when many more applications started using GFS. He talks about some of the biggest issues they ran into GFS was having a single master in charge of a FS cluster where they ran out of how much metadata a single master could keep in its memory thereby having an inherent limit on the number of files a GFS cluster can run.
Later in the interview, he talks about a new version of GFS they are building that uses a distributed master model where they can add more machines to the GFS cluster and the machines will be able to distribute the load of replication, chunking automatically. Clearly, this will handle more load, more files, will be provide higher availability and better performance.
Few things that intrigued me about this interview:
(i) GFS, which was originally built for batch file processing, has evolved to support more online applications like gmail. This introduced new performance, availability and durability requirements.
(ii) How the changing application patterns have driven the design of GFS to a more distributed model that meet the demanding availability and performance needs of online applications.
(iii) The experience with “loose” (eventual) consistency model in GFS and how they handled different failure modes. Looks like their biggest issue was in dealing with client failures as clients were in charge of data reconciliation (which is one of the biggest challenges with eventual consistency). Looks like to avoid these issues, they are moving to a single writer per file model, basically serializing all the writes. Seems like a reasonable approach to provide a tighter bound on consistency (at the expense of possibly reduced “write availability”).
Overall, this was a very insightful interview for me and it is interesting to see how similar some of these problems are what Amazon has seen and solved in the past.
I am really looking forward to read a new SOSP/OSDI paper on GFS v2.