Thoughts on Choosing a Cluster File System | Administration

Beyond NFS, but not to far

OK, OK, so there are plenty of File System articles on Cluster Monkey. But, from an administrators perspective, no good discussion of cluster administration can continue without coming to that thorniest of issues, file systems and I/O. In the long history of parallel computing, I/O in most cases could have stood for Ignored/Oops! instead of Input/Output. In the more recent history of Beowulf clusters, I/O has finally received some quality attention by some quality people, but there is still no "silver bullet" file system, and providing the right kind of storage for the right job remains one of the biggest headaches for cluster administrators.

ï»¿ Once again, this topic could fill many columns. So, in this column, we're not going to dive into the complexities of choosing your parallel file system, but rather discuss from the system administrators perspective when you should use a parallel file system at all, and when some other option is appropriate.

In truth, the file system itself isn't the whole issue. The design of the entire I/O subsystem in your cluster, hardware and software is a critical issue. Should you use disk-less nodes? Should you use multiple servers with SAN attached storage? These are important questions, and perhaps we'll get to them in the future, but for this month, let's assume a garden-variety cluster, look at the trade-offs, and hopefully we'll get more sophisticated over time.

By a garden variety cluster, I mean a set of single or dual processor nodes (which now translates into two or four cores per node) with a disk in each node, a head node capable of serving as an NSF server, or perhaps a separate NSF server, maybe with a RAID attached, but no SAN storage.

A cluster like this gives you essentially three options for user file systems on your compute nodes. You can use your head or server to provide an NFS (Network File System) to each node. You can use the disks in the compute nodes to provide a separate file system to each node. Or, you can use a parallel file system, such as PVFS, to distribute data across some or all of the disks in your compute nodes. Note: RAM-disk (e.g. Warewulf) based file systems are used on clusters, but these hold system files and are not generally employed by users as part of a parallel application.

So, as the system administrator of your handy cluster, which of these options do you want to implement? The answer, naturally, is all three. But while providing this range of options gives you a lot of flexibility, it can be confusing for your users, and being able to provide reasonable guidance on when to use which is an extremely important function of the cluster administrator.

Why All Three?

If you've used any of the standard cluster building packages described in past columns, such as Scyld, OSCAR, or ROCKS, or you've had some cluster integrator set up a cluster for you with their own custom suite, then you're going to have an NFS file system. If you installed your cluster yourself, the head node is probably the server. Why is this the universal default? Simple, ease-of-use. With NFS, every node sees the same file system, setting it up is easy, and everyone knows how to use it. But as your cluster gets larger, NFS will suffer severe performance and even correctness problems. This situation isn't really a criticism of NFS, clusters just aren't the problem NFS was designed to solve. When NFS was designed, the environment in mind was an office full of computers, where anyone could sit down at any machine and access the same files. In most cases, each user would be accessing different files. At most, a few machines might be contending for access to the server at a given time. In this environment, NFS works great.

But Linux clusters changed the model, particularly when running parallel jobs. A typical parallel job will be comprised of many tasks, but each task will be running the same program, and they will start out more or less in sync with each other. So, when your program goes to read/write from a file, all parallel tasks will do that at the same time; and probably from/to the same file. Instead of a single user accessing a file from some workstation somewhere on the network at a given time, tens or hundreds of compute nodes are hitting the same file at the same time. NFS doesn't scale well at all in these conditions, and it's simply not meant to work in this way. In practice, you are likely to find that if your parallel job does even moderate I/O, NFS will work reasonably well at 4 or 8 processor jobs, but by 32 processors you will be seeing significant performance penalties, and by the time you break 100 it may fail altogether. Putting in a better fileserver with a faster network interface, or tuning your NFS server daemons to run with more threads can mitigate this problem up to a point, but sooner or later, NFS will break down (my personal best is around a 225 processor job using NFS; it completed, but performance was abysmal).

The fundamental problem here is you have a bottleneck at the server. The disk or disk arrays attached to your server have a finite capacity. Even if you could get an infinitely fast disk, the network interface on the server would still be a limit. As the number of clients simultaneously accessing the server increases, the bandwidth available to each one will go down (actually, it's worse than you might think; many network protocols degrade rapidly when the network becomes congested, so running 100 NFS clients might be much worse than half the speed of 50).

The solution here is to get rid of the bottleneck. You must not only use more disks, but the disks must be distributed across different nodes in your cluster. While there are a number of high-end ways to achieve this, if you have a disk in each node in your cluster, you probably want to use it. One approach to doing this is to just create a file system on each disk on use it just on that compute node. This works, and gives you much better and more scalable performance than NFS. The problem, obviously, is that you have a lot of different file systems. which means no two nodes see the same files. If you are writing your own programs, this is not necessarily a fatal flaw, though most available applications won't support this.

To use local file systems, you're going to have to do some extra manual work. If each parallel task of your program needs to read from a file, you need to copy the file out to every node before you begin (most cluster management systems have a single command that will do this). While this is easy to do, it takes time. If you must copy the file from a single node to each compute node, you have essentially the same bottleneck as NFS. However, if your program reads the file multiple times, or you will do multiple runs using the same file, this can be a huge win.

If your program needs to write data to a file instead of just read it, you have a bigger problem. If the program expects a single file system, each task is probably going to write to a different part of the file. Using local file systems, you may end up with lots of copies of the same file that are mostly empty space. Merging these back into a single coherent file when the program finishes typically requires some extra code, and may not be for the faint of heart. If you are programming your own application around a local file system model, you may be able to have all programs write data at the beginning of each file; then gathering data is simply a matter of copying the files to one place and concatenating the files together. Again, this takes some time, so your program better be doing enough I/O to amortize the cost of distributing and collecting files before and after the run.

Another option for using all those local disks is to use a parallel file system such as PVFS. PVFS will make use of multiple disks in your cluster, divide your data among them, but still present the user or program with the same set of files on every node. PVFS removes the need for distributing and collecting your data, but it is a little harder to set up and more susceptible to hardware failures than NFS (after all, it's using more hardware). More importantly, PVFS can perform badly if you are accessing small files, or just performing small reads and writes on big files. PVFS provides tremendous performance, but only on applications with tremendous I/O requirements.

How To Advise Users

In the end, you'll end up with a combination of these approaches, perhaps all three. You'll find there are situations where each one works well. The trick is in helping your users recognize these situations, and take advantage of the proper option.

For programs doing only a little I/O, for instance reading a few parameters out of a configuration file, or writing a single line of output every few minutes to mark progress, NFS is the clear choice. You'll find many programs that fit this mold, and you should look no further. At the opposite end of the spectrum, if you have applications using large numbers of nodes doing large amounts of I/O, read/write transactions of a megabyte or more and total I/O of hundreds of megabytes or gigabytes, a parallel file system is absolutely essential. In most cases, NFS will not only get out-performed, it will flat out fail.

Local file systems tend to work best when programs need to make small to medium sized reads and writes repeatedly to the same file or set of files. An important extension of this concept is not a single application that does this but perhaps a large set of runs of an application that may share the same input data. The cost of the initial replication of data across nodes or collection at the end can be high; it's usually not worth it if the data will only be read once and discarded. This approach also works best with users willing to do a little work to squeeze out maximum performance. If it's a program that will run once and be discarded, it's probably not worth the effort to deal with multiple local file systems.

Onward

Once again, we've reached the end of a column about the time I feel like I've finished the introduction. Hopefully, you a general feel for what your file system options are (short of calling your local Storage Area Network vendor), and when the appropriate times to use them. Like many things, the key to using file systems effectively in clusters is to not become to addicted to any one approach, and to spend a little time experimenting with what works best for a particular problem. You'll inevitably find the need for multiple file system options in your cluster anyway; don't hesitate to try the same problem on each. You'll be surprised at the difference it can make.

Finally, an astute reader pointed out that we missed a resource manger in a previous column. SLURM is a production resource manager used and developed at Lawrence Livermore National Labs. It is now more widely available under the GNU public license. Like PBS and LSF, it allows for integration with MAUI and other schedulers. One of the strengths of SLURM is it's ability to tolerate node failures and continue functioning. SLURM is in use on cluster of 1,000 nodes already.

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Dan Stanzione is currently the Director of High Performance Computing for the Ira A. Fulton School of Engineering at Arizona State University. He previously held appointments in the Parallel Architecture Research Lab at Clemson University and at the National Science Foundation.