RAID and File Systems Discussions

Some aid for those that use RAID

The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article we turn our attention to other mailinsg lists that also can provide useful information. In this article I review some postings in the Rocks-Discuss and LVM mailing lists where we report on RAID and file system preferences.


Most of the time the mailing lists for specific cluster applications or cluster distributions are devoted to specific questions about the application or distribution. However, some times you will see general questions and very good responses from knowledgeable people on these lists. Rocks is a popular cluster distribution. On January 6, 2004, a simple question to the Rocks mailing list gave rise to some good recommendations. Purushotham Komaravolu asked for recommendations for a RAID configuration for about 200 GB of data (recall that RAID stands for Redundant Array of Inexpensive Disks).

{mosgoogle right}

Greg Bruno provided the first answer. He said that for pure capacity (not necessarily throughput) you should use a 3ware 8006-2LP serial ATA controller with two 200 GB (Gigabyte) serial ATA drives that are configured for mirroring (RAID-1). He said that this should give about 80 Megabytes/sec (MB/s) in read performance and about 40 MB/s in write performance. For more performance, Greg recommended using a 3ware 8506-4LP serial ATA controller and four 100 GB ATA drives configured as RAID-10 (two sets of mirrored drives which are then striped over the two sets). Greg was estimating performance as 160 MB/s for read IO and 80 MB/s for write IO, if you use decent disks.

Jon Forrest joined in the discussion saying that he had a difficult time getting the Promise and Iwill RAID cards (RAID-0 or RAID-1) working with Linux. Greg Bruno responded that they had good luck with 3ware controllers and bad luck with the controllers that Jon mentioned. However, Tim Carlson joined in that he was not impressed with the RAID-5 performance of the 3ware controllers even using serial ATA (SATA) drives. Tim said that he had never gotten more than 50 MB/s using RAID-5 and SATA. He recommended going with SCSI drives and a SCSI RAID controller along with software RAID. Tim finally suggested using a box of IDE (ATA) disks with a back end controller that converts things over to SCSI or FC (Fibre Channel). He said that in his experience this solution scales nicely to tens of TB (terabytes).

Joe Landman jumped in to say that using RAID-5 for high performance is not a good idea. Rather one should be using something like RAID-0 (striping) for increased performance. Joe also took issue with the idea of using SCSI disks. Joe said that in his experience ATA drives were very good but suffered from an interrupt problem that leads to increased CPU load to the point that you could swamp a CPU by writing many, many small blocks at the same time (think of a cluster head node or NFS file server). SCSI controllers hide this behind a controller interface. Joe went on to discuss that current CPUs have much more power than the controller in a RAID card. However, combining software RAID over a cheap hardware controller is asking for trouble, particularly for large loads. Joe ended that he agreed with Tim's recommendation of using IDE disks with a back end controller that converts to SCSI or FC.

A little later Joe said that the important question was what file system people were running on their RAID disks. Joe said that XFS was the best and should be incorporated into ROCKS (note that XFS is now part of the standard 2.4 and 2.6 kernels from Joe Kaiser chimed in that he thought XFS was great and that they have had very good luck with it. Tim Carlson jumped back in to say that he has good luck with ext3. Joe Kaiser responded that they had some data corruption with ext3 for large arrays when the disk has been filled all of the way. Joe and Tim then discussed several aspects of design including the importance of understanding your data needs and your data layout.

This discussion points out that there are several important considerations when designing a file server for a cluster. Considerations such as your data layout, the host machine (CPU power), disk types, RAID controllers, monitoring capabilities, and file system choice, can all have a great effect on the resultant IO performance.

ROCKS: Using Other File Systems

A couple of months after the previous discussion about RAID, a discussion about alternative file systems was begun on the ROCKS-Discuss mailing list. On 16 April, 2004, Yaron Minsky asked about using something other than ext2 on the master node of his ROCKS cluster, particularly ReiserFS or XFS. Phillip Papadopoulos replied that this was a bug in ROCKS 3.1.0 forcing you to use ext2 and would be fixed in the next release. However he did say that you could convert the ext2 filesystem to ext3 using C to add a journal.

Laurence Liew responded that he thought ext2, ext3, ReiserFS, and XFS all had their strengths and weaknesses. He suggested using ext2 for a while to understand the application usage pattern. He also said that in some cases, modifying the layout of the cluster would have a bigger impact than changing file systems. Yaron replied back that he thought ext3 faired worse than XFS or JFS in benchmarks. Laurence replied that he remembered some SNAP benchmark results that showed ext3 winning in certain cases.

There was some discussion about whether Red Hat included ReiserFS and/or XFS in the version of RHEL (Red Hat Enterprise Linux) that ROCKS uses. It was finally determined that XFS was not included but ReiserFS was included but as an unsupported RPM. Later on, Josh Brandt mentioned that he thought ReiserFS would do better on lots of small files compared to other file systems. However, for large files ReiserFS performed worse than other file systems. Yaron, the original poster, posted his basic usage pattern (size of files, number of files, number of directories, etc.). Josh thought he should give ReiserFS a try.

While this discussion is brief it does show that there is a difference in file system performance among various people and groups.



    Login Form

    Share The Bananas

    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.