Survey of File Systems for Clusters (And Storage Options)


IBM has had a global parallel file system for many years. The Global Parallel File System (GPFS) started with the IBM SP systems running AIX, but has been ported to Linux for certain IBM eServer, blade server, and xServer products.

GPFS is a high-speed, parallel, distributed file system that achieves high-performance by striping data across multiple disks and multiple nodes. To further improve performance, it uses client-side caching as well as read-ahead and write-behind functions. It can use large block sizes from 16KB to 1024 KB to help improve IO throughput depending upon requirements.

To get HA capability, it can be configured using logging and replication. It can be configured for fail over at a disk-level and at the server level. This means that if you lose a node GPFS will not lose access to data nor degrade performance unacceptably. As part of this it can transparently fail-over lock servers and other core GPFS functions so that the system stays up and performance is at an acceptable level.


{mosgoogle right}

There is a relatively new open-source file system called Lustre that has been developed for Linux clusters. It follows a model that the newest version is only available from Cluster File Systems, Inc. while the previous version is available freely from It can potentially scale to ten of thousands of nodes and hundreds of Terabytes of storage. Lustre stores data as objects, called containers, that are very similar to files, but are not part of a directory tree. The advantage to an object based file system is that allocation management of data is distributed over many nodes, avoiding a central bottleneck. Only the file system needs to know where the objects are located, not the applications, so the data can be put anywhere on the network.

As with other file systems, Lustre has a metadata component, a data component, and a client part that accesses the data. However, Lustre allows these components to be put on various machines, or a single machine (usually for home clusters or for testing). The metadata can be distributed across machines called MetaData Servers (MDS), to ensure that the failure of one machine will not cause the file system to crash. The MetaData Servers support failover as well. In Lustre 1.x, you can use up to two MDS machines while in Lustre 2.x, you can have tens or even hundreds of MDS machines.

The file system data itself, is stored as objects on the Object Storage Servers (OSS) machines. The data can be spread across the OSS machines in a round-robin fashion (striped in a RAID sense) to allow parallel data access across many nodes resulting in higher throughput. The data is also distributed to ensure that a failure of an OSS machine will not cause the lose of data. The client machines mount the Lustre file system in the same way other file systems are mounted. They communicate with the OSS machines for direct data access.

The MDS machines, OSS machines, and clients can all reside on one machine (most likely for testing or learning) or can be split among multiple machines. For example, you could make all of the nodes within a cluster OSS machines and clients so that they can see the entire file system. You could choose one or more of the nodes in the cluster to be MDS machines for fail over redundancy.

Lustre uses open network protocols to allows the components to communicate. It uses an open protocol called Portals, originally developed at Sandia National Laboratories. This allows the networking portion of Lustre to be abstracted so new networks can easily be added. Currently it supports TCP networks (Fast Ethernet, Gigabit Ethernet), Quadrics Elan, Myrinet GM, Scali SDP, and Infiniband. Lustre also uses Remote Direct Memory Access (RDMA) and OS-bypass capabilities to improve I/O performance.

Recently some performance testing was done with Lustre on a cluster at Lawrence Livermore National Laboratories (LLNL) with 1,100 nodes. Lustre was able to achieve a throughput for each OST of 270 MB/sec and an average client I/O throughput of 260 MB/sec. In total, for the 1,000 clients within the cluster, it achieved an I/O rate of 11.1 GB/sec.


One of the first parallel file systems for Linux clusters was PVFS (Parallel Virtual File System). The original PVFS system, now called PVFS1, was originally developed at Clemson University. It is not designed to be a file system for users home directories but rather a high-speed scratch file system for the storage of input and output data during the run of a job. It is an open-source project that has involved several developers. There is a new version of PVFS called PVFS2 which is a complete rewrite the code to make PVFS more expandable for new systems and to add new features.

PVFS2 is a complete rewrite of PVFS1 focusing on improving the scalability of PVFS, the flexibility of PVFS, and the performance of PVFS. PVFS1 has been modified for various networking protocols beside TCP, but the modifications were very invasive and difficult to support. PVFS2 has abstracted the networking layer allowing new networking protocols to be used very quickly. For example, Infiniband support was added in about a week with only about 3,000 lines of C code. Currently PVFS2 supports TCP, Infiniband, and Myrinet. In addition, PVFS2 accomodates heterogeneous clusters allowing x86, PowerPC, Itanium machines to all be part of the same PVFS file system. PVFS2 also adds some management interfaces and tools to allow easier administration.

Both PVFS1 and 2 divide the functions of the file system into two pieces, a metadata server and data servers, and allow clients to mount PVFS as a normal file system. The metadata server holds information about the data in the file system, and the IO devices (IOD's) actually hold the data. In PVFS1 the metadata is put onto one machine. In PVFS2, the metadata can be distributed. The data is then spread across the IO devices (IOD's) which can be the nodes within a cluster. PVFS stores the data as files on top of an existing file systems such as ext3 or xfs. So you can mix the file systems under a PVFS system without worry. When data is written to PVFS it is broken into chunks of a certain size and then written to the IOD's in some fashion, usually in a round-robin fashion. The size of the chunks, what IOD nodes are used for storage, and how the chunks are written, are all configurable, allowing PVFS to be tuned for maximum performance for a given application or class of applications.

One of the features of PVFS is that one of the IOD's can go down and PVFS will still be available. PVFS will recognize that one of the IOD's is down and not write to it. Instead it will go on to the next IOD in the list. Consequently, applications that are writing will not hang or crash (as long as they are not writing to the IOD node when it crashes). However, any application that was reading data that was on the IOD will not be able to fully access all of the data since that IOD is down. If the IOD is brought back into production without the storage space being altered, the data will once again be available for use. On the other hand, if the storage space on the returned IOD has been altered, such as re-installing Linux, then the file(s) that used that IOD are corrupt and cannot be recovered. This is a design decision that the PVFS developers have made. PVFS is designed to be a temporary high-speed storage space, not a place for home directories. The intent is to use PVFS for storage of data during an application run and then move the data off of PVFS onto a permanent storage space that is backed by tape or some other type of archiving media.

There are several ways to use PVFS. The best way to take advantage of the parallel capabilities involves using an MPI-IO implementation such as ROMIO or ChaMPIon that uses PVFS directly (either PVFS1 or PVFS2). You can also use the normal Linux kernel interface but you won't get the parallel speed of PVFS. This interface allows PVFS to be mounted as a normal file system on the client nodes and for users to interact with PVFS using typical commands such as ls or mv or rm. Currently PVFS1 and PVFS2 cannot run binaries (i.e. you can't run an application stored in PVFS). In the case of PVFS1, there is also a third interface, the PVFS library. This library has semantics very similar to the standard C library file functions. It allows the parallel performance capabilities of PVFS to be used, but requires some changes to your application.

{mosgoogle left}

PVFS is in use at various sites around the world. PVFS1 is in production use at several locations including the University of Utah and the Ohio State Supercomputer Center. PVFS2 in production use as well. It is being used on Orion Multi-System desktop clusters.

The performance of PVFS is heavily dependent upon the network performance. The typical ethernet networks will work well, but will limit scalability and ultimate performance. High-speed networks such as Myrinet, Quadrics, and Infiniband (IB) greatly improve scalability and performance. In tests, the IO performance of PVFS scales linearly with the number of IOD's up to at least 128 IODs. At that point, the performance exceeds 1 Gigabyte/sec. PVFS2 has been tested with 350 IOD's, each with a simple IDE and connected via Myrinet 2000. With 100 clients, PVFS2 was able to achieve an aggregate performance of about 4.5 Gigabyte/sec in writes and almost 5 Gigabyte/sec in read performance running an MPI-IO test code.


A few years ago some researchers from the University of Minnesota formed a company called Sistina to develop open-source file system tools for Linux. They developed lvm (Logical Volume Manager) for Linux and GFS (Global File System). The goal of GFS is to provide a true global name space for clustered machines using various networking options (such as Fibre Channel or GigE) that is resilient to server or network loss.

After a period of time, Sistina took GFS and made it closed-source. Recently, Redhat has purchased Sistina and re-open-sourced GFS. Redhat is offering GFS with commercial support as part of their Red Hat Cluster Suite for $499 for up to 8 nodes (it requires a subscription to Red Hat Enterprise Linux (RHEL) on these servers).

GFS is a global distributed file system that can run on Linux systems that are connected to a Storage Area Network (SAN). These networks can be constructed with Fibre Channel or iSCSI networks. It has distributed locking as well as distributed meta data so if a server in the file system goes down no data or access to data is lost. It has a dynamic path capability so in the event that a switch or HBA (Host Bus Adapter) goes down, it can still get to the other servers in the SAN. It has quota capability as well. GFS is flexible enough that various configurations for clusters are possible. This includes both networking and storage options as well. Redhat is claiming that GFS scales to hundreds of machines and tens of Terabytes.



    Login Form

    Share The Bananas

    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.