Thoughts on Choosing a Cluster File System

Article Index

Beyond NFS, but not to far

OK, OK, so there are plenty of File System articles on Cluster Monkey. But, from an administrators perspective, no good discussion of cluster administration can continue without coming to that thorniest of issues, file systems and I/O. In the long history of parallel computing, I/O in most cases could have stood for Ignored/Oops! instead of Input/Output. In the more recent history of Beowulf clusters, I/O has finally received some quality attention by some quality people, but there is still no "silver bullet" file system, and providing the right kind of storage for the right job remains one of the biggest headaches for cluster administrators.

{mosgoogle right} Once again, this topic could fill many columns. So, in this column, we're not going to dive into the complexities of choosing your parallel file system, but rather discuss from the system administrators perspective when you should use a parallel file system at all, and when some other option is appropriate.

In truth, the file system itself isn't the whole issue. The design of the entire I/O subsystem in your cluster, hardware and software is a critical issue. Should you use disk-less nodes? Should you use multiple servers with SAN attached storage? These are important questions, and perhaps we'll get to them in the future, but for this month, let's assume a garden-variety cluster, look at the trade-offs, and hopefully we'll get more sophisticated over time.

By a garden variety cluster, I mean a set of single or dual processor nodes (which now translates into two or four cores per node) with a disk in each node, a head node capable of serving as an NSF server, or perhaps a separate NSF server, maybe with a RAID attached, but no SAN storage.

A cluster like this gives you essentially three options for user file systems on your compute nodes. You can use your head or server to provide an NFS (Network File System) to each node. You can use the disks in the compute nodes to provide a separate file system to each node. Or, you can use a parallel file system, such as PVFS, to distribute data across some or all of the disks in your compute nodes. Note: RAM-disk (e.g. Warewulf) based file systems are used on clusters, but these hold system files and are not generally employed by users as part of a parallel application.

So, as the system administrator of your handy cluster, which of these options do you want to implement? The answer, naturally, is all three. But while providing this range of options gives you a lot of flexibility, it can be confusing for your users, and being able to provide reasonable guidance on when to use which is an extremely important function of the cluster administrator.

Why All Three?

If you've used any of the standard cluster building packages described in past columns, such as Scyld, OSCAR, or ROCKS, or you've had some cluster integrator set up a cluster for you with their own custom suite, then you're going to have an NFS file system. If you installed your cluster yourself, the head node is probably the server. Why is this the universal default? Simple, ease-of-use. With NFS, every node sees the same file system, setting it up is easy, and everyone knows how to use it. But as your cluster gets larger, NFS will suffer severe performance and even correctness problems. This situation isn't really a criticism of NFS, clusters just aren't the problem NFS was designed to solve. When NFS was designed, the environment in mind was an office full of computers, where anyone could sit down at any machine and access the same files. In most cases, each user would be accessing different files. At most, a few machines might be contending for access to the server at a given time. In this environment, NFS works great.

But Linux clusters changed the model, particularly when running parallel jobs. A typical parallel job will be comprised of many tasks, but each task will be running the same program, and they will start out more or less in sync with each other. So, when your program goes to read/write from a file, all parallel tasks will do that at the same time; and probably from/to the same file. Instead of a single user accessing a file from some workstation somewhere on the network at a given time, tens or hundreds of compute nodes are hitting the same file at the same time. NFS doesn't scale well at all in these conditions, and it's simply not meant to work in this way. In practice, you are likely to find that if your parallel job does even moderate I/O, NFS will work reasonably well at 4 or 8 processor jobs, but by 32 processors you will be seeing significant performance penalties, and by the time you break 100 it may fail altogether. Putting in a better fileserver with a faster network interface, or tuning your NFS server daemons to run with more threads can mitigate this problem up to a point, but sooner or later, NFS will break down (my personal best is around a 225 processor job using NFS; it completed, but performance was abysmal).

The fundamental problem here is you have a bottleneck at the server. The disk or disk arrays attached to your server have a finite capacity. Even if you could get an infinitely fast disk, the network interface on the server would still be a limit. As the number of clients simultaneously accessing the server increases, the bandwidth available to each one will go down (actually, it's worse than you might think; many network protocols degrade rapidly when the network becomes congested, so running 100 NFS clients might be much worse than half the speed of 50).

The solution here is to get rid of the bottleneck. You must not only use more disks, but the disks must be distributed across different nodes in your cluster. While there are a number of high-end ways to achieve this, if you have a disk in each node in your cluster, you probably want to use it. One approach to doing this is to just create a file system on each disk on use it just on that compute node. This works, and gives you much better and more scalable performance than NFS. The problem, obviously, is that you have a lot of different file systems. which means no two nodes see the same files. If you are writing your own programs, this is not necessarily a fatal flaw, though most available applications won't support this.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.