Coding for PVFS1 and PVFS2
Arguably, one of the most popular parallel file systems is PVFS (Parallel Virtual File System). Work on PVFS began around 1993 at Clemson University. It has since grown through partnerships with the academic, government, and industrial community. PVFS also enjoys a licensed under the LGPL (Lesser General Public License). It has been deployed in production at a number of sites for several years. There are currently two versions of PVFS: PVFS1, the original version; and PVFS2, the new version under development and very close to production release. While PVFS is a great concept, most efficient use comes from recoding your application to take true advantage of either PVFS1 or PVFS2 (unless your code already uses MPI-IO). However, rewriting the I/O portion of the your code to use MPI-IO is really not a difficult task. In this column I'll be examining the various ways one can code to use PVFS1 or PVFS2.
Before any coding or recoding begins you should remember that PVFS was designed to be a high-speed scratch file system. It was not designed to be a permanent file system with backup capability. Consequently, the aim of PVFS development is on speed. This single mindedness makes development of codes for PVFS fairly easy.
PVFS1 was the original file system started at Clemson University in the Parallel Architecture Research Laboratory (PARL). Once clusters began to become popular, it was evident that commercial parallel machines enjoyed an advantage over clusters in the area of parallel file systems. PVFS1 was developed with two objectives in mind: to provide a basic software platform for pursuing the development of parallel I/O and parallel file systems, and to provide a stable, full-featured parallel file system for Linux clusters.
PVFS2 uses some the basic concepts from PVFS1 but redesigns the architecture to make it more scalable, more flexible, and more portable. It abstracts the storage layer allowing different types of storage techniques, and it also abstracts the networking layer allowing different types of networking protocols besides TCP to be easily integrated into PVFS2. In addition, scalability has been re-examined in all phases of the redesign to ensure that PVFS2 is scalable to thousands if not tens of thousands of nodes.
Both PVFS1 and PVFS2 are similar in their basic architecture. PVFS splits the tasks of the file system into pieces. It allows the multiple metadata to be separate from the data servers. PVFS1 uses one metadata server, while PVFS2 allows multiple servers. Data are stored on other servers or I/O nodes, called "iod's" (short for IO daemons). Both PVFS1 and PVFS2 are "virtual" file systems in that they use existing local file systems such as Ext2, Ext3, ReiserFS, and XFS to store data local to each server, rather than relying on its own file system design. PVFS data are striped in files across some or all of the iod's within the PVFS system. Typically, these data are written in round-robin fashion although PVFS2 allows different access patterns.
I won't be discussing how to setup and configure either PVFS1 nor PVFS2. Please see the sidebar that lists the links for PVFS installation support. Let's start with by looking at PVFS1 and how one can access the file system from your codes.
PVFS1 Coding
PVFS1 was designed so that there would be several API's (Application Programming Interfaces) to PVFS and that applications developed using the standard UNIX I/O API must still work with PVFS. Three classes of API's are made available to the user: a native PVFS1 API; the "UNIX/POSIX-like" I/O API; and other API's such as MPI-IO.
The UNIX/POSIX I/O API allows codes that have been built with the standard UNIX I/O functions such as fopen(), open(), fprintf(), fread(), fwrite(), read(), write(), fclose(), etc., to all will work with PVFS1 without any changes to the code and without any recompiling or relinking. The PVFS kernel module provides this functionality. On each PVFS1 client, there is a small daemon (pvfsd) that is run. All function calls are processed by the VFS (Linux kernel Virtual File System interface) which decides if the operation is a PVFS function. If it is a PVFS operation, then the pvfsd performs the operation on behalf of the application. The data from the I/O functions will use the default stripe size from when PVFS1 was compiled, usually 64K. Also, all of the machines defined in PVFS1 will be used beginning with the first machine in the .iodtab file.
The functions read()and write() perform the data transfer with the I/O nodes each time a call is made. For accessing small amounts of data, this can be terribly time consuming since the data may lie on another node which must be accessed over the network. On the other hand, fread() and fwrite() locally buffer small file access and perform data exchanges with the I/O nodes in chunks of some minimum size. Consequently, the data transfer rate is much better using these functions. However, there are some consistency issues with fread() and fwrite() due to the buffering. Also, using PVFS1 for formatted I/O using functions such as fprintf() and fscanf() is not a good use of the file system because the data streams are usually small and will not take advantage of the speed capability of PVFS1.
While running codes that use the "UNIX/POSIX-like" API with PVFS1 will work, to get better performance one should look at porting the code to use the native PVFS API. This port may seem daunting but is in fact it is fairly painless. The PVFS1 I/O functions are intended to be very close to the UNIX/POSIX calls, so porting can be require minimum effort.
For example, the normal open() function can be replaced with the pvfs_open() function:
int open(const char *pathname, int flags, mode_t mode); int pvfs_open(const char *pathname, int flags, mode_t mode, struct pvfs_filestat *dist);
The structure pvfs_filestat is an optional argument to define the data distribution that the user wants. If the data structure is not passed then the default distribution is used. The data structures looks like the following:
struct pvfs_filestat { int base; /* first iod node */ int pcount; /* number of iod nodes*/ int ssize; /* stripe size */ int soff; /* NOT USED */ int bsize; /* NOT USED */ }where base is the starting I/O node number (it starts with 0) and pcount is the number of I/O nodes you want to use, up to the total number of nodes in PVFS1 (set pcount = -1 to use all of the nodes). The element ssize is the stripe size. The elements soff and bsize are not currently used. You can see the semantics of using the PVFS1 API is very similar to the standard UNIX/POSIX I/O API. Many of the other UNIX/POSIX I/O functions have PVFS1 counterparts: pvfs_write(), pvfs_open(), pvfs_lseek(), pvfs_lseek64(), pvfs_close() and so on. Most of the time you can port your code to use the native PVFS1 library by a simple global find/replace that just changes the I/O function called.