File Systems O'Plenty Part Two: NAS, AoE, iSCSI, and more!


GlusterFS is a parallel distributed file system has been designed to scale to several Petabytes and is licensed with the GNU open-source license. The fundamental design philosophies of GlusterFS is performance (speed) and flexibility. One of the choices the designers made was to make GlusterFS what is sometimes termed a virtual file system. In many ways it is like PVFS. It stores data in files within a conventional file system such as ext3, xfs, etc., on the storage devices. But the design of GlusterFS goes even further and does not store metadata. This allows the file system to gain even more speed because no central, or even distributed, metadata servers are in the middle of the data transaction (between the clients and the data servers). Instead it relies on the metadata of the local file system in the IO nodes. This is true for distributed file systems or for a parallel file system (more on that later).

Fundamentally GlusterFS is a distributed, network based parallel file system. It can use either TCP/IP or InfiniBand RDMA interconnects for transferring data. At the core of GlusterFS is what are called "translators." Translators are modules that define how the data is accessed and stored on the storage devices. The design of translators also allow you to "chain" or stack translators together to combine features to achieve the behavior that you want. The design of the file system, the use of translators, and the ability to stack them probably makes GlusterFS the most powerful and flexible file system available.

{mosgoogle right}

The fundamental storage unit for GlusterFS is called a "storage brick". Basically a storage brick is a storage node with some storage that makes a file system available for storing data. Fundamentally, you take a node within the cluster, and format a file system on a volume. Then GlusterFS can use this file system as directed by the configuration. This design allows you to either use every node in the cluster that has disks in it, including compute nodes, as storage bricks. Or you can use dedicated storage nodes that have a number of disks in them with RAID controllers.

As with most network file systems, GlusterFS has two parts, a client and a server. The GlusterFS Server, typically run on the storage bricks, allows you to export volumes over the network. The GlusterFS client mounts the exported volumes in a fashion dictated by the translators.

The concept of "translators" is very powerful because it allows you to take the exported volumes and mount them on a client in a very specific manner that you specify. You can have a specific set of clients mount specific volumes, creating something like a SAN. Or you can specify that the clients should mount the volumes to create a clustered, parallel, distributed file system. You can also control how the data is written to the volumes to include mirroring and striping. The translators can be broken into various families based on functionality:

  • Performance translators
  • Clustering translators including what are called scheduling translators that describe how the data is to be scheduled to the storage
  • Debugging translators
  • "Extra feature" translators
  • Storage translators (currently only POSIX)
  • Protocol translators (server or client)
  • Encryption translators
Given the range of capabilities that translators cover you can see how stacking or chaining them together can give you a great deal of capability. It's even possible to write translators for specific applications or specific classes of applications.

Here is the list of current translators available for GlusterFS.

  • Performance Translators
    • Read ahead Translator
    • Write behind Translator
    • Threaded IO Translator
    • IO-Cache translator
    • Stat Pre-fetch Translator
  • Clustering Translators
    • Automatic File Replication (AFR) Translator
    • Stripe Translator
    • Unify Translator
    • Schedulers:
      • ALU (Adaptive Least Usage) Scheduler
      • NUFA (Non-Uniform File System Access) Scheduler
      • Random Scheduler
      • Round-robin Scheduler
  • Debug Translators
    • trace
  • Extra Feature Translators
    • Filter
    • Posix-locks
    • Trash
    • Fixed-id
  • Storage Translators
    • POSIX
  • Protocol Translators
    • Server
    • Client
  • Encryption Translators
    • rot-13
As you can see there are a number of translators than can be used to build your GlusterFS file system. You should look at the GlusterFS Documentation on translators to learn what they can do and the options you can use with each translator.

One of the initial design goals in GlusterFS was to avoid striping. While this may sound funny - typically striping is used to increase IO performance - the designers of GlusterFS felt that striping was not a good idea. If you look at the GlusterFS FAQ you will see a discussion about why striping was not initially included. According to the GlusterFS developers, the 3 reasons they think that striping is bad are:

  • Increased Overhead (disk contention and performance decreases as load increases)
  • Increased Complexity (increased file system complexity)
  • Increased Risk (Lose of a single node means loss of an entire file system)
If you look at the list of translators, the GlusterFS developers have developed a stripe translator. The reasons that the developers did this are:
  • Implementing the strips translator was easy (couple of days work).
  • There are a few companies that want striping features in a file system as their applications use a single large file.
  • If the file is big, a single server can get overloaded when the file is accessed. Striping can help spread the load.
  • If you use the cluster/afr translator with cluster/stripe GlusterFS can provide high availability.
In addition, the developers feel that using InfiniBand/RDMA will provide more than enough IO bandwidth for very high IO rates that people may require.

One of the most interesting features of GlusterFS is the fact that it doesn't use any metadata for the file system. Instead it relies solely on the metadata of the underlying file system. So how does it keep track of where files are located if the files are striped are if the files are written to various storage bricks (i.e. using the afr module. I'm glad you asked that.

In the case of a clustered file system that also has the round-robin scheduler, when the client starts to write a file to the file system, GlusterFS creates a sub-directory with that file name. Then the current IO server is sent the first part of the file and the second server is sent the second part and so on. Each file part is written to the sub-directory. When you include AFR (automatic file replication), which is highly recommended, you have copies of the file parts on other servers. That way if an IO server goes you have a copy of the file on another server and AFR makes a copy of the missing piece and puts it in the proper location. Since the file system knows which was the first server for the file because of the sub-directory, it can recover any missing piece of the file. It seems somewhat complicated, but it definitely eliminates the need for a metadata manager.

GlusterFS is a new file system which means it is still maturing. But the developers and designers of it have thought out of the box to come up with a very interesting and very, very promising file system. It is very flexible and some ingenuous ideas about how to make a high performance file system while still keeping the resiliency in the file system.

EMC Highroad (MPFSi)

Not to be outdone by other companies, EMC has developed a parallel file system that is appropriate for clusters. Originally, the file system was called HighRoad (also called MPFS) and used Fibre Channel (FC) SAN protocols to deliver data to the clients. This aspect limited use because every client had to have an FC adapter to connect to HighRoad. EMC then developed Highroad into MPFSi (Multi-Path File System) by adding iSCSI as the protocol for delivering data to and from the storage devices. This feature increased the appeal of MPFSi since you only needed a TCP network to get data to and from the clients.

The data transfer process of MPFS/MPFSi is similar to both Lustre and Panasas (more on those two file systems later). There is an EMC Celerra server that is the sole metadata manager for the file system. The data is stored on either the Celerra itself (for up to 30 clients) or on a SAN using EMC Symmetrix or CLARiiON units. The clients communicate with the metadata server using NFS and the storage devices using either iSCSI (TCP) or an FC SAN network. Typically people will use iSCSI to communicate to an FC switch that translates the iSCSI protocol to the FC protocol and sends the request to the Symmetrix or CLARiiON.

Each client runs two pieces of software, an MPFS agent and a SAN initiator (iSCSI or Fibre Channel). When a client makes a data request, the MPFS agent contacts the metadata server using NFS as the protocol and gets back a map of where the data is located (or to be located in the case of a write function). Then the client uses iSCSI to directly contact the storage in parallel using the map. So the file system behaves in much the same way as Lustre, Panasas, and pNFS. But from what I can tell, MPFSi is not an object based file system, but a more traditional file system that uses SAN based (block based) storage and uses a lock manager.

When MPFSi writes the data it stripes the data across multiple LUNs where each LUN is part of a RAID group (RAID-3 or RAID-5). This increases the parallelism for better throughput. EMC claims that MPFSi is suitable for applications that access data in 32KB chunks or greater. However, with a single metadata server, even with the performance of a Celerra, there might be a bottleneck in the metadata response for very large number of small blocks. But in general the Celerra is a very high performing system, so there should not be an issue scaling to hundreds of clients performing simultaneous file access. In addition, the design of the file system greatly reduces the metadata traffic.

To maintain compatibility with NFS and CIFS, the Celerra acts as a NFS and CIFS gateway for the clients. This allows the clients to access the file system without using the EMC supplied software.


CXFS is a distributed file system from SGI. SGI originally developed a file system called XFS for their IRIX operating system. They then ported it to Linux and made it available as open source. It has proved to be a very fast and very stable file system (in my opinion the best local file system for Linux, but that's another story). SGI then extended XFS to work in Clustered environments (CXFS).

CXFS is designed to work in a SAN environment. This design allows the actual storage hardware to be decoupled from the server nodes. The actual storage can be centralized and then the servers access the data via a SAN network. The server nodes then share one or more volumes that are consolidated in one or more RAID arrays. CXFS allows files up to 8 million Terabytes (8 Exabytes), and file systems up to 18 Million TB (18 Exabytes). It is also a POSIX compliant file system and supports IRIX, Solaris, Windows, 32-bit and 64-bit Linux, AIX, Mac OS X, and 64-bit SGI Altix Linux. All of these machines can share the same CXFS file system. CXFS also supports POSIX, BSD, and SVR4 file locks, so just about any code should be able to use CXFS.

As with many parallel file systems, CXFS splits the metadata from the actual data to allow faster data processing. CXFS also manages the data and the metadata separately from one another. The data is accessed by the clients over a fiber connection to the SAN, allowing for a higher performance data access. However, the metadata is accessed over Ethernet using TCP/IP. For every CXFS file system, there is a server that is designated as a metadata broker that acts upon the metadata requests. The metadata broker manages file locks when accessing data. CXFS also has the ability to dynamically expand the space for metadata (handy feature for large file systems). For redundancy you can have multiple metadata servers for a file system.

CXFS also has something called GRIO, which stands for Guaranteed Rate I/O. This feature allows certain applications or users to be given a guaranteed level of IO rate for a specified period of time. This type of transfer can be very useful if the system has to copy data to the file system from an outside source in a certain period of time (think satellite data from weather satellites). It can also be useful if certain projects or groups need priority.

Data integrity is a very important topic for file systems. CXFS has several features that give a fair amount of HA (High Availability) capability. It has been designed to detect and recover from single system failures, failure of the active metadata server, failure of the SAN or TCP network components, and failure of the network or SAN partition. In particular, CXFS allows you to configure several backup metadata servers so that if the active metadata server fails, then a backup metadata server can take over without loss of access to the file system (I don't know if any metadata in flight is lost, but I assume so).

I know there are a number of CXFS deployments with SGI's large NUMA Altix systems. I've also heard that the performance is very good. But I haven't heard of any deployments with CXFS for clusters.

Red Hat GFS

A few years ago some researchers from the University of Minnesota formed a company called Sistina to develop an open-source file system and tools for Linux. They developed the original lvm (Logical Volume Manager) for Linux and GFS (Global File System). The goal of GFS is to provide a true global name space for clustered machines using various networking options (such as Fibre Channel or GigE) that is resilient to server or network loss.

After a period of time, Sistina took GFS and made it closed-source. A couple of years ago Red Hat purchased Sistina and re-open-sourced GFS. Initially, tests with GFS on clusters did not meet users expectations, but things have improved since then. GFS is a global distributed file system that can run on Linux systems connected to a Storage Area Network (SAN). These networks can be constructed with Fibre Channel or iSCSI networks. It has distributed locking as well as distributed metadata so if a server in the file system goes down no data or access to data is lost. It has a dynamic path capability so in the event that a switch or HBA (Host Bus Adapter) goes down, it can still get to the other servers in the SAN. It has quota capability as well. GFS is flexible enough that various configurations for clusters are possible. These configurations include both networking and storage options as well. Red Hat is claiming that GFS scales to about 100 servers and tens (hundreds?) of Terabytes.

You can also us GFS in conjunction with NFS and even CIFS (via Samba). You can add gateway nodes to a GFS file system that re-export the GFS file system using NFS or CIFS. In essence, this creates a clustered NAS solution where you can expand the back end storage as far as GFS will allow you to (in other words, quite large), and you can add almost as many gateways as you want or need. Currently, you can get Red Hat GFS and the Red Hat Cluster Suite included in subscriptions to Red Hat Enterprise Linux (RHEL) 5 or as layering options to RHEL 3 and 4.

To my knowledge, there have not been many GFS deployments for HPC clusters. GFS is usually intended for Enterprise systems. While there are many approaches one could use in deploying GFS in a cluster, I see three options that are somewhat typical. The first option is to create a SAN using GFS and several data servers. This SAN is then connected to a computational network that the cluster nodes can see. Each cluster node then runs GFS to connect to the SAN as a client.

The second option is to make each compute node a node within the SAN. This seems a bit overkill, but it does allow each compute node to do both computation as well as serve as part of the storage. I can see scaling problems in this case because each node has to compute and has to serve data as part of GFS as well. Consequently scaling could become problematic. But, I don't have any hard data to confirm this.

The third option is to use GFS as a back-end file system for NFS gateways to create a scalable NAS. You can scale the backend to add capacity and you can add more NFS gateways for more aggregate performance.

That is all for part two. Coming Next: Part Three: Object Based Storage with the obligatory summary table!

I want to thank Marc Ungangst, Brent Welch, and Garth Gibson at Panasas for their help in understanding the complex world of cluster file systems. While I haven't even come close to achieving the understanding that they have, I'm much better than I when I started. This article, as attempt to summarize the world of cluster file systems, is the result of many discussions between where they answered many, many questions from me. I want to thank them for their help and their patience.

I also hope this series of articles, despite their length, has given you some good general information about file systems and even storage hardware. And to borrow some parting comments, "Be well, Do Good Work, and Stay in Touch."

A much shorter version of this article was originally published in ClusterWorld Magazine. It has been greatly updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer. He lives in the Atlanta area and can sometimes be found lounging at the nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).

© Copyright 2008, Jeffrey B. Layton. All rights reserved.
This article is copyrighted by Jeffrey B. Layton. Permission to use any part of the article or the entire article must be obtained in writing from Jeffrey B. Layton.



    Login Form

    Share The Bananas

    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.