So Why Use Disks on Clusters?

Spinning platters is what matters

A convergence of several key technologies has begun to change some of the old assumptions about how to build clusters. These technologies are high-speed affordable interconnects, high-speed file systems and storage network protocols, MPI implementations with MPI-IO, and high performance processors. This nexus of technologies is allowing the next possible steps in cluster evolution to become mainstream - diskless nodes.

Diskless nodes don't have local disks for storage so they need to have access to some storage system to function as computational nodes. In some cases the data needs to be seen by all nodes in the cluster. This is sometimes referred to as a "global file system" or a "shared" file system. In other cases the data can just be replicated on the nodes so that each node has it's own copy. I will refer to this as "local" storage.

Which type of storage do you need? Good question grasshopper. The answer is, "it depends upon your application(s)." I hate to give such a vague answer, but it's the truth." But let me explain. If every process in your application (I'm assuming MPI here) needs to read the same input file, but does not need to read or write any data during the run that needs to be shared b all processes, then you can use just local storage. If your code doesn't meet this requirement, then you need a shared file system. Of course, there is always that third case where the code doesn't necessarily need a shared file system, but it makes things easier and the users run using local storage anyway. We'll ignore this third possibility for now.

{mosgoogle right}

In the past, diskless nodes have used NFS as a way to provide global cluster wide storage. While used on many clusters NFS can handle the basics, but when pushed, is not adequate for many I/O heavy applications. Fortunately, there are a variety of storage system options such as AFS or high-speed parallel file systems such as Lustre, IBRIX, GPFS, Terrascale, or Panasas, and file systems that use network storage protocols such as iSCSI, HyperSCSI, or ATA-Over-Ethernet (AOE). Also, the growing availability of MPI implementations that include MPI-IO allow codes that utilize its capabilities to use alternative file systems. This capability broadens the available file systems to include PVFS2.

In this article I'd like discuss these file systems and how they enable diskless cluster nodes. In this article I'll focus on small to medium clusters, up to perhaps 64 or 128 nodes. In the next article I'll focus on larger clusters, over 128 nodes, including clusters with over 1,000 nodes.

Assumptions

To discuss diskless systems, let's assume that each compute node does not have a hard drive. The necessary OS files are either stored in memory locally (RAM Disk), or are mounted on the node via NFS, or are available through a network based file system. While this assumptions seems straight forward and reasonable you would be suprised how many people have a fundamental problem with nodes without disks.

As I mentioned previously in this article I will focus on what I call small to medium systems. These systems are interesting because there are so many options available. Some of the options are open-source and some are commercial products. However, they all provide the storage capability that diskless nodes need.

Let's loo at the file systems that enable diskless nodes to be possible, starting with our old friend - NFS.

NFS

The first storage capability that clusters of this size should consider is good old NFS. NFS has been around for a few years and while slow compared to other file systems it is very well known and understood. Moreover, for small to medium systems it has been found to be very stable. It uses a basic client/server model with the server exporting a file system to various clients who mount the file system.

One of the goals of a networked file system such as NFS is to allow all systems that mount the file system to access the exact same files. In the case of clusters this allows all of the nodes within the cluster to access and share the same data. However, be careful when writing data to the same file from more than one node because NFS is stateless and you likely not to get the results you expected. In other words - don't do it.

The history of NFS is very interesting. With the explosion of network computing came the need for a user to access a file on a remote machine in the same or similar way that they access a local file. Sun developed NFS (Network File System) to fill this need. NFS is now a standard and is in version 3, with version 4 coming out soon. Since it is a standard, the client and server can run different operating systems yet share the same data.

The storage on the server can take many forms. Probably the simplest form is for the NFS server to have internal disks. For smaller clusters, the NFS server and the master node can be one in the same. The internal disks can be configured in any way you like. To make life easier down the road, it is a good idea to configure the disks using lvm. This step allows the storage space to be easily grown or shrunk as needed. If the file system can tolerate shrinking or growing the file system, while it's on-line, then you can avoid having to reformat the file system and restore the data from a backup.

Once the drives are configured, and the file system formatted on the drives, then the file /etc/exports is edited to "export" the desired file systems. Details of this process, if you are having trouble, are at the Linux NFS website. There is also a Howto on configuring NFS and for performance tuning.

There has also been some good discussion on the NFS mailing list about good options to choose for both exporting the file system and for mounting the file system. In addition, there have been some comments about NFS for diskless systems. The sidebar contains some links to good discussions from the NFS mailing list.

Another option for using NFS is to buy a preconfigured NAS (Network Attached Storage) box. Many of these boxes are basically servers stuffed with a bunch of disks. Other NAS boxes have some special hardware and software to improve the performance. NAS boxes have storage with a file system that it exports to the cluster using NFS.

There are many, many vendors who sell basic NAS boxes. Quite a number of them are of the basic white-box variety. If you want to purchase boxes from these vendors be sure to check whether they have delivered tuned NAS boxes for Linux. Be sure to ask for as many details as possible. Also, ask for performance guarantees from them or ask for a test box that you can use for testing.

There are other NAS box companies such as Netapp, EMC, Agami, and BlueArc that provide higher performance NAS boxes. Whether these companies can deliver cost effective equipment for small to medium clusters is up to you to decide. Again, be sure to do your homework and ask for a test box or a performance guarantee.

PVFS and PVFS2

I've written several columns about PVFS and PVFS2. PVFS, like NFS, provides a global file system for the cluster. However, this file system breaks the mold of a typical file system in that it is a specialized file system intended as a high-speed parallel scratch file system. However, for codes to take true advantage of PVFS, they need to be, ideally, ported to use MPI-IO.

PVFS usually works well, if you use it in combination with either another global file system such as NFS, or local file systems. The primary reason for this is that PVFS does not currently allow binaries to be run from the file system. So you cannot copy your executable to a PVFS file system and run it from there. So, you need to run the executable from another location, such as an NFS file system or a local file system.

Like iSCSI and HyperSCSI, PVFS has a very flexible configuration. PVFS is a virtual file system. That is to say, it is built on top of an existing file system such as ext2, ext3, jfs, reiserfs, or xfs. You select which machines will run PVFS, either as an I/O storage node or a metadata node. Currently PVFS2 supports TCP, Infiniband, and Myrinet interconnects. It also supports multiple network connections (mulithome).

Configuring PVFS2 is very easy. You select some machines with disks as servers (although you can use a single machine for this purpose). A machine can act as a server and/or a metadata server. Then PVFS2 is installed and configured on these machines. The client software is installed on the diskless nodes to mount the PVFS file system.

The PVFS2 FAQ gives some good guidance on selecting the number of servers. However, for small clusters that only run one job and have only one PVFS server, NFS may give better results. For larger clusters, having more than one PVFS server will definitely help.

PVFS2 is very flexible allowing many possible configurations. For example, one can take a few servers that have several disks and use them as PVFS2 servers. To make things easier I recommend using md to combine the block devices on the disks into a single block device. If you think you might need additional storage space at some point, then it would be a good idea to use lvm with the disks to allow more storage space to be easily added. Of course, you could use the disks in a RAID configuration, either hardware or software RAID, to gain some resilience, and use lvm on top of the RAID-ed block devices.

One other option is to use multiple networks from the servers to the clients. Many server boards come with multiple GigE ports, so it's very easy to just add another set of GigE switches to the network. This extra network also gives you some resilience in case of network failure.

AFS

The Andrew File System or AFS grew out of the Andrew Project which was a joint project between Carnegie Mellon University and IBM. It was very much a project focused on the needs of academia, but IBM took much of the project and commercialized it. Of of these projects was AFS.

AFS is somewhat similar to NFS in that you can access the same data on any AFS client, but is designed to be more scalable and more secure than NFS. AFS is distributed so you can have multiple AFS servers to help spread the load throughout the cluster. To add more space, you can easily add more servers. This is more difficult to do with NFS.

One of the features of AFS is that the clients cache data (either on a local hard disk or in memory). Consequently, file access is much quicker than NFS. However, this does require a mechanism for ensuring file consistently across the file system and can cause a slowdown with shared file writes. Also, if a server goes down, then other clients can access the data by contacting a client that has cached the data. However while, the server is down, the cached data can not be updated (but it is readable). The caching feature of AFS allows it to scale better than NFS as the number of clients is increased.

iSCSI

An option for providing local storage is iSCSI. I've talked about iSCSI before, but just to recap, it stands for Internet SCSI (Small Computer System Interface). The iSCSI protocol defines how to encapsulate SCSI commands and data requests into IP packets and sends them over an Ethernet connection, unpacks them, and creates a local stream of SCSI commands or data requests.

In an iSCSI configuration, a server, which is called a target, provides storage space in the form of block devices that are "exposed" over an Ethernet network. A client, called an initiator, mounts the block device(s) and builds a file system on them and then uses it as though it it were local storage. One important point to make is that iSCSI will only provide a file system local to each node (initiator) that as far as the OS is concerned is a local disk. NFS on the other hand provides a global file system so that all of the nodes see the same files.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.