File Systems O'Plenty Part Three: Object Based Storage


One of the first parallel file systems for Linux clusters was PVFS (Parallel Virtual File System). PVFS is an open-source project with a number of contributors. PVFS is not designed to be a persistent file system for something like user's home directories but rather as a high-speed scratch file system for the storage of data during the run of a HPC job.

The original PVFS system, now called PVFS1, was originally developed at Clemson University but with the advent of PVFS2 it is not in use as much. PVFS2 is a complete rewrite of PVFS1 focusing on improving the scalability of PVFS, the flexibility of PVFS, and the performance of PVFS. Currently PVFS2 natively supports TCP, InfiniBand, and Myrinet networks. In addition, PVFS2 accommodates heterogeneous clusters allowing x86, x86_64, PowerPC, and Itanium machines to all be part of the same PVFS file system. PVFS2 also adds some management interfaces and tools to allow easier administration.

PVFS2 divides the functions of the file system into two pieces, metadata servers and data servers. In general, PVFS2 has only one type of server - pvfs2-server. The actual function a particular server fulfills is defined in a configuration file. For example, it could be a metadata server or a data server. Any given pvfs2-server can fulfill either the metadata server or data server or both functions. PVFS2 also has the ability to accommodate multiple metadata servers (PVFS1 can only accommodate one). After starting the metadata and data servers, the clients then mount PVFS as if it were a normal file system.

{mosgoogle right}

The "V" in PVFS stands for Virtual. This means that it doesn't write directly to the storage devices, but instead the data resides in another file systems that does the actual IO operations to the storage devices. When data is written to PVFS, it is sent to the underlying file system in chunks. The underlying file system then writes the data to the storage. For example, you could use ext3, ResierFS, JFS, XFS, ext2, etc. as the underlying file system for PVFS.

The virtual part of the file system sounds more complicated than it actually is. If you happened to look at a PVFS directory on a particular file system, you will see lots of smaller files with strange names. Each file is part of a PVFS file. If you combine these files in an appropriate way, you will get the real file that is stored in PVFS. So you can think of PVFS has a file system on top of a file system. Consequently PVFS looks a lot like an object based file system and is, in fact, an object based file system.

This approach gives PVFS a great deal of flexibility. For example, you could have several storage nodes running ext3 and another set of nodes running XFS and PVFS will work perfectly fine with all of them. When data is written to PVFS it is broken into chunks of a certain size and then sent to the data servers in some fashion, usually in a round-robin fashion. The size of the chunks, which storage nodes are used, and how the chunks are written, are all configurable, allowing PVFS to be tuned for maximum performance for a given application or class of applications.

There are a number of features that PVFS2 uses for improved performance. For example, it can utilize multiple network interfaces on the client node. This design means that if a node has both a TCP NIC and a Myrinet NIC, you can use both network interfaces for PVFS data. You can also change the underlying file system for improved performance. The behavior of the local file system can have a big impact of the PVFS2 performance so there is room for performance improvement. You can also use extended attributes to set directory hints to improve performance.

Recall that PVFS is designed to be a high-speed scratch file system. Consequently it does not have much fault tolerance built into it. You can achieve some level of fault tolerance but you have to configure it. For example, PVFS2 does not naturally tolerate disk failures, but you can use disks that are mirrored or behind a RAID controller to give some fault tolerance. You can do the same for network interfaces by using more than one interface. You can even configure the servers to be fault tolerant by using HA between two servers. The PVFS website has an paper on how you can go about using HA.

PVFS is fault tolerant in one respect. If you lose a server the file system will not go down. For example let's assume you have several data servers and one of them goes down while it is being used by clients. If any client was actively writing to that server, then the data on the network or in the process of being written will be lost. But any client that was not using the server will then continue to run just fine. If a client was using the server but was not actively writing data to it, then it will continue to run but will not use the down server during the next IO phase. If the down server is returned to service without any changes to the file system underneath PVFS, then the data that was on the server will be available once again. But if the disks in the down server are wiped, for example, then the data is lost. This is a design decision that the PVFS developers have made. PVFS is designed to be a high-speed storage space, not a place for home directories. The intent is to use PVFS for storage of data during an application run and then move the data off of PVFS onto a permanent storage space that is backed by tape or some other type of archiving media.

The decision to focus on PVFS as a high-speed scratch system and not a file system with lots of redundancy has actually freed the developers so that they can focus on improving performance without being shackled by keeping compatibility with a file system interface. The developers have created several ways to access PVFS from the clients. First, there is a system interface that is used as a low-level interface for all of the other layers. The second interface is the management interface. It is intended for administrative tasks for PVFS such as fsck or for low-level file information. Then there is a Linux kernel driver that is really a kernel module that can be loaded into an unmodified Linux kernel so the Linux VFS (Virtual File System) can access PVFS2. This allows standard UNIX command such as ls and cp to work correctly. And finally there is a ROMIO PVFS2 interface that is part of the standard ROMIO distribution. ROMIO supports MPI-IO for various MPI libraries. Using this device ROMIO can better utilize PVFS2 for IO functions.

The performance of PVFS is heavily dependent upon the network performance. The typical Ethernet networks will work well, but will limit scalability and ultimate performance. High-speed networks such as Myrinet, Quadrics, and InfiniBand (IB) greatly improve scalability and performance. In tests, the IO performance of PVFS scales linearly with the number of data servers up to at least 128. At that point, the performance exceeds 1 Gigabyte/sec. PVFS2 has been tested with 350 data servers, each with a simple IDE and connected via Myrinet 2000. With 100 clients, PVFS2 was able to achieve an aggregate performance of about 4.5 Gigabyte/sec in writes and almost 5 Gigabyte/sec in read performance running an MPI-IO test code.

PVFS2 has also been ported to IBM's BlueGene/L. In a fairly recent test, PVFS2 was run on a BlueGene/L at Argonne Labs. In a test using MPI-IO-TEST on 1,024 CPUs, 32 IO nodes (part of BlueGene/L), and 16 P4 storage nodes, PVFS2 was able to achieve an aggregate of about 1.2 GB/s for reads and about 350 MB/s for writes.

So Long and Thanks for All of the File Systems

The file systems that I have covered in this article are just some of what's available. I've tried to cover the more popular file systems for clusters, but undoubtedly I have missed some. Moreover there may be file systems, such as OCFS2, that may not have been tested or even considered for clusters. I invite you investigate other file systems and after reading this series or articles, you will be able to look at other file systems with a more critical eye.

The big summary table is next!



    Login Form

    Share The Bananas

    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.