Resilient PVFS, Yes It Is Possible | FileSystems

How to sleep well when running PVFS

In the last article we looked at performance improvements for PVFS1 and PVFS2. In this installment, we'll examine improving the resilience or redundancy of PVFS as well as putting some flexibility into the configuration.

Redundancy or Resiliency is the ability to tolerate errors or failures without failure of the entire system. For PVFS (Parallel Virtual File Systems) this means the ability to tolerate individual failures without all of the PVFS being unavailable.

As with performance, there are three areas on which we can focus to improve the resilience of PVFS: PVFS configuration, IO servers configuration, and storage space. These three things will allow us to improve the redundancy or the ability to tolerate errors and faults and keep PVFS functioning. At the same time you'll find that many of these steps will also improve the flexibility of PVFS.

Since it's easy to think about PVFS in terms of storage space, let's examine how we can configure storage space to improve the resiliency and flexibility of PVFS.

Using RAID on IO Servers

In the previous column we discussed using RAID to improve the performance of the IO servers. We can also use RAID to improve the ability of PVFS to tolerate the loss of a drive(s) without PVFS losing any data. The IO servers and the metadata server(s) (recall that PVFS2 allows you to have more than one metadata server) can both take advantage of RAID levels for redundancy. Using at least RAID-1 (mirroring) will allow IO servers to tolerate the loss of a drive without losing any PVFS data. For example, you could use two drives, mirror them (RAID-1), build a file system on them, and run PVFS from this file system. If a drive fails, you replace it and it rebuild from the other drive in the background. If you have hot-swap drives, PVFS never goes down.

You can also use RAID-5 to get a larger number of drives and redundancy. Combinations of RAID, such as RAID-10 (combining RAID-1 and RAID-0) or RAID-50 (combining RAID-5 and RAID-0) could also be used. These combinations have some redundancy from RAID-1 or RAID-5 respectively, but also gain some speed because of the striping (RAID-0).

As mentioned in the previous column, you can use either dedicated hardware RAID controllers or software RAID (included in most Linux distributions). From a redundancy standpoint either solution will be fine. You can also use combinations of them. For example, you could use two hardware RAID controllers each with a RAID-1 set, and then use software RAID to stripe across the two sets (RAID-0).

Using LVM with PVFS

It's a fundamental law of nature and physics that users always want more space. If you prepare before hand you can deal with the never ending whining, Oops, I mean "requests," of users for more space. Fortunately, given the design of PVFS, it is fairly easy to accommodate the request for more space.

Recall that PVFS is a virtual file system that is built on an existing file systems such as ext2, ext3, jfs, xfs, reiserfs, or reiser4. The easiest way to add space to PVFS is to be able to add extend which ever file system you have chosen to put on your hard drives.

LVM (Logical Volume Manager) allows storage space to be adjusted as needs change. There are a number of things you can do with LVM to help you with PVFS. First, by creating the underlying file system using LVM, you can add more space to the file system as needed. You can also add a physical storage device (aka, a hard drive) and then extend a volume group, extend the logical volume, and then resize the file system to use the new space.

For a performance boost, you can configure LVM to use striping. Combining this with RAID-0 should be done carefully to get the best performance and retain the flexibility of LVM.

Hardening Metadata Servers

PVFS1 supports one metadata server while PVFS2 supports multiple metadata servers for performance reasons, but in both cases there is no failover design. So, in general, the metadata server is a single point of failure. Despite the fact that the metadata server is very lightly loaded, it would be good to perhaps consider options to improve the reliability.

One of the first things you can do is to improve the resilience of components in the metadata server to failure. For example, you can have redundant fail-over power supplies in the metadata servers. Another good idea is to make the storage on the metadata server more redundant. For example, using RAID-1 or RAID-5 with hot swap would allow a more resilient metadata storage system.

One could even go a bit further and use a High-Availability (HA) configuration for the metadata server. By configuring an active-passive HA configuration, the passive machine could take over in the event that the active system failed.

HA the IO/Metadata Servers

One other thing you can do is to make the IO servers into high availability systems. In this case, you could make one IO server an "active" machine and another IO server a "passive" machine. The passive machine won't participate in PVFS, but could take over the function of the active machine if the original active machine fails for some reason. This configuration means that you will use twice as many machines for the same level of performance. You get a much more redundant PVFS but the configuration is twice as costly and you could be using those extra machines for improved throughput. However, because PVFS is so flexible such a configuration is definitely possible if you decide it is needed.

There is a document in the ~/doc section of the PVFS2 source that discusses a high-availability experiment the PVFS2 development performed. They began with two Dell machines with PowerEdge RAID Controllers in them. They also shared a Dell PowerVault with seven 160 GB disks in a RAID-5 configuration. The two nodes were connected with a cross-over GigE cable. If you are going to do an active-active configuration then you have to create one partition on the shared storage. Otherwise you create two partitions, one for each node.

In the first experiment the team configured the two nodes as an active-passive pair. In other words, one of the nodes is considered the "active" node providing storage and the other node just watches the active node via a heartbeat cable between the two machines. The active machine has an IP address known to the rest of the machines in the cluster. If it fails, then the passive machine will change it's IP address to match the known address. The rest of the machines won't even know the first one is down except that they may have to do a retry for some functions (Note: PVFS2 has the ability to retry operations).

During the experiment with the two nodes operating in failover, they brought down the first node (the active one) by simply turning off the heartbeat software. The second node thought the first one was down so the second one took over the IP address, the file system, and programs. When the first node is brought back up into production the operations will migrate back to the first node, if you configured the heartbeat software to do so.

They also did an experiment with an active-active configuration. In this configuration you have two nodes both serving as PVFS2 storage. Each one has it's own storage area on the shared storage device. The idea behind an active-active configuration is that both nodes are serving storage space, but if one dies the other will serve out all of the other's storage space. You have to configure the servers carefully, but it's not difficult thanks to the efforts of the PVFS2 development team.

Either configuration; active-passive, and active-active, was shown to work by the PVFS2 team. If you want to implement high availability on your metadata server(s) and/or your IO servers, you can pick either configuration. The active-active configuration is appealing because you can use all of the servers while they are up and functioning. However, configuration for the active-active configuration is a bit more difficult.

Multiple PVFS Partitions

Another benefit to PVFS is that it is easy to configure and group the IO servers in any fashion you wish to meet your requirements. One of the requirements might be to keep PVFS functioning as much as possible. We've already discussed some things that you can do to help this, but one thing that people often overlook is that you can group or configure the various IO servers however you want. One way to do this is to take all of the IO servers and group them into distinct PVFS systems.

For example, if you have 18 IO servers, you could break them into two groups of 9 IO servers, or three groups of 6 IO servers, etc. You can then mount each group on it's own set of clients. Each group would mount the specific PVFS group as /mnt/pvfs but PVFS will be from a different set of IO servers for each client grouping. The applications could then use /mnt/pvfs regardless of the client they are on.

You could then run a code on each group of clients which has it's own PVFS subsystem. This configuration gives you the benefit of being able to take down one of the PVFS groups for maintenance while the other PVFS groups stay in production. Or if a failure occurs in one of the groups, the other groups are still functional.

However, there may be problems with this approach since you might have to adjust the scheduling configuration so that a particular parallel job only got the clients associated with one group.

An alternative configuration is to create several different PVFS groups and mount them with different mount point names. In the previous example you could have /mnt/pvfs1, /mnt/pvfs2, and /mnt/pvfs3. The user codes could use whichever group they wanted. Or you could adapt the codes to look at each group. If one group has more space or perhaps is faster, then you could have the application write to that group.

Moreover, the codes could write a copy of their data to each PVFS group (a quasi-RAID 1). This method will require more time since the code is basically writing IO multiple times. However, since PVFS is so fast, this may not be noticeable. The upside is that you are unlikely to loose a node in all three groups. Consequently, the data should always be available in at least one group.

Of course if you have a fixed number of IO servers, creating separate PVFS groups will be limit the maximum PVFS performance available to your cluster. You can always add more IO nodes to your PVFS groups, however.

For slower networks connecting the clients and the IO servers, breaking the IO servers into more than one group could improve overall throughput. With too many IO servers communicating over a relatively slow network, the network will be saturated and performance will either plateau or get worse. By separating into multiple groups the traffic is potentially going to be better balanced since the the user applications will be at different points in their computations. Therefore the overall throughput should be better. Alternatively, you could put the IO servers on separate networks that are slower so that PVFS does not saturate the network. Then overall throughput will be better than using a single network.

Check Pointing HOWTO

Many codes do checkpointing. That is, they write their current state of computation to a file. The concept is that if the code dies for whatever reason, the code could be restarted from the last good checkpoint. This method saves time, particularly for long computations.

Remember that PVFS is intended as a high-speed scratch file system. Writing your check point files to PVFS is a very good thing to do since the file system is so fast. However, there is a chance one of the IO servers could go down and you will not have access to the files that were using the IO server that is down. However this danger is true for any file system, not just PVFS. Let's take a few moments to examine how we might modify our codes to better do check pointing.

A simple approach to checkpointing is to write the state of computations to a file at some interval during the code run. The checkpoint file name is usually the same since this saves file space. This method is also convenient from a coding point of view since the code uses the same file name for writing the checkpoint and for reading the checkpoint. However, if during the writing of the checkpoint file a problem occurs then the checkpoint file will be corrupt and you will have lost the benefits of checkpointing, i.e. you must restart your entire program. Moreover, if the file systems, becomes corrupt or goes off-line, then you will have to wait until the file system has been repaired or restored to get the checkpoint file back.

There are several ways to avoid some of these problems. They are the same for PVFS or any other file system. The first thing you should do is write to multiple files and partitions. I would recommend rotating through at least two, preferably three files, and partitions, if available.

The first write to a checkpoint should write to the first name. If possible you should read the data back in to make sure the file is the correct size (this is optional of course). You can also do an estimate of the size of the file to make sure it is correct. After the file has been written and you have determined the file size is correct, do an md5sum on the file and save it as well. Also, if possible, the file should be copied from PVFS to a file systems that is backed up.

The next checkpoint should write to the next file name. After writing it should follow the same process of checking the size, computing the md5sum of the checkpoint file, and copying the file to a file system that is backed up.

This process continues for as many checkpoint files as you want. After you have written the last in the series, you then use the first filename, then the second, and so on.

The key to this process is using multiple files for writing checkpoint data. Also, be sure to compute the md5sum and if possible copy the checkpoint files from PVFS to another file system that is backed up.

RAID-1 Within PVFS Itself

Every so often the idea of using RAID-1 (mirroring) within PVFS itself is asked on the PVFS and PVFS2 mailing lists. The concept would be to split the IO servers in half, create a PVFS file system from half, and then mirror it on the other half of the IO servers. Then if an IO server goes down, the mirrored PVFS can take over until the faulty IO server is brought back on line.

There a couple of downsides to this idea. First, you are only using half of your IO servers which means you will only get half the speed. Second, the RAID-1 operation means that the throughput will be slowed because of the need to copy the data to the mirrored IO servers. You can look at this one of two ways - you will get less than half the speed you could be getting -or- you are paying twice the money for the same speed.

Moreover, remember the intention of PVFS. It is designed to be a high-speed scratch file system. The key word is scratch. Therefore, redesigning or adding internal components to make PVFS more resilient goes against the basic tenant of PVFS design. Even though the developers of PVFS do their best, to add things that help the resiliency of PVFS, they will normally not do anything to sacrifice the performance potential.

Parting Comments

This column illustrates many ways you can improve the resilience and the flexibility of PVFS. Some of these options are trades and some options improve both the throughput and flexibility of PVFS. As always your application should dictate how you deploy PVFS.

Sidebar One: Links Mentioned in Column

PVFS1

PVFS2

Software RAID Article

LVM2

LVM

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer. He lives in the Atlanta area and can sometimes be found lounging at the nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).