Classical Big Data Reading – Google File System

This time I’ll discuss “The Google File System (GFS) by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung from 2003. While GFS is propriety to Google, it was the direct inspiration for the Hadoop Distributed File System (HDFS), which is the fundamental layer of the popular Hadoop ecosystem.

In a nutshell, what GFS does is it takes a cluster of commodity servers with local disks and builds a fault-tolerant distributed file system on to top of it. The main innovation was picking up a new set of assumptions and optimizing the file system around a specific use case from Google:

  • The cluster is designed for low-end, low-cost nodes with local disks. Specifically, failure of disks and servers is handled automatically and transparently as they are expected to be a common, normal happening,not some rarely tested corner case.
  • The file system does not aim to be general-purpose. It is optimized for large, sequential I/O for both read and writes (high bandwidth, not low latency).
    In addition, GFS aims to hold relatively few (millions) files, mostly large ones (multi-GBs).

The architecture of GFS will look very familiar if you know HDFS. In GFS, there is a single master server (similar to HDFS Name Node) and  one chunkserver per server (similar to HDFS Data Node). The files are broken down to large, fixed-size chunks of 64MB (similar to HDFS blocks), which are stored as local linux files and are replicated for HA (three replicas by default). The master maintains all the metadata of the files and chunks in-memory. Clients get metadata from the master, but their read/write communications go directly to the chunkservers.The master logs metadata changes persistently to a local and remote operation log (similar to HDFS EditLog), but chunk location metadata is not persisted it is gathered from the chunkservers during master startup etc etc.

GFS Architecture

Cool features – surprisingly, GFS had in 2003 some features that are yet to appear in HDFS. For example, GFS supports random writes in the middle of files and atomic concurrent appends (‘record append” operation), with some documented side effects (under “consistency model”). In addition, GFS supports snapshots of files and directories, using COW at the chunk level. Also, the master is supposed to restart in seconds when it crashes (the examples later show only 10s of MB of metadata on it) and it automatically writes a new checkpoint of its metadata periodically without a need for a separate service for that (like HDFS Secondary Name Node). Finally, there exist “shadow” masters that delivers read-only access to GFS even when the master is down.

To quote the paper, as of 2003, GFS was “widely deployed” at Google. The paper proudly states that “The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients”, which is still pretty impressive node count. However, I find the performance numbers discussed pretty lame compared to today’s standards – not surprising given the 100Mb/s network used.

To sum it up, it is fascinating to see how influential the original GFS design was, and some functionality gaps that still still exists between its 2003 version and today’s HDFS implementation. If you want to know more on the current design of HDFS, its architecture guide should be your next read.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s