Autonegotiation, Diskless, PVFS, Multicast Discussions | Beowulf List

Past Postings from the Beowulf, Bioclusters, PVFS Users Lists

There are a huge number of mailing lists available on the Internet. The Beowulf mailing list is a resource for the cluster community. This column will summarize and expand on issues that have been discussed on the Beowulf list and also include issues from other mailing lists that are useful and pertinent to clusters. In this column we visit the Bioclusters, Beowulf, and PVFS lists. We begin our survey in summer of 2003.

Bioclusters: Discussion of autonegotiation between switches and NICs

There was a very interesting discussion on the Bioclusters mailing list that started with Victor Ruotti asking on Aug. 8th 2003 why his Apple Xserve systems would not connect at full-duplex. The initial approach suggested was to force the systems to full-duplex by using ifconfig. Donald Becker came to the rescue and explained that forcing full duplex on both ends, the NIC (Network Interface Card) end and the switch port end, was a bad idea that can lead to administration headaches and nightmares. He said that auto-negotiation is reliable and any failure is likely due to flawed hardware or due to poorly configured network switches and gave a couple of links for people to read. He also pointed out that the transceivers on almost all NICs will fall back to autosensing the link speed if autonegotiation fails and how this affects the link speed of a connection. He finally finished up with a brief historical perspective on why the forced manual configuration was started and why it has stayed so prevalent in network configurations.

What we can take from this discussion is that you should always allow the NIC and switch to autonegotiate to avoid problems down the road. You can then check your network connection using some diagnostic tools or test the connection using some benchmarking tools. If the connection is not what you think it should be, then the problem may be with bad hardware or with a faulty switch configuration.

There are a number of tools to help you diagnose your network connection. Donald Becker's excellent website, can help you get started and of course, you can always post to the mailing lists, such as the Beowulf mailing list, to get help. But before you post, please read Donald's website, go over the beowulf mailing list archives, and use a search engine, such as Google, to search for help first, and then post your questions to the mailing list.

Beowulf: Where diskless computer nodes are appropriate?

On July 15th 2003, Tod Hagan asked the Beowulf list a question, "where are diskless compute nodes inappropriate?". The idea of a diskless node is to do away with the hard disk in the node and use a network (usually NFS - Network FileSystem) mounted root filesystem. They are used because they require less power, have better reliability, fewer moving parts, and are cheaper than nodes with disks. Nicholas Henke responded that diskless nodes are not appropriate when accessing data locally is faster than via NFS or some other network filesystem. Also, he mentioned that diskfull nodes (nodes with hard disks) are appropriate when the application uses swap for memory. He gave an example of his site where they run BLAST (a bioinformatics application) and they have large data sets that perform a large amount of disk I/O (Input/Output). Putting disks in the nodes and copying the data to the nodes gives faster run times than running over NFS. Joe Landman echoed this and gave some specific numbers and examples and pointed out that a central fileserver serving out a filesystem can be swamped if a relatively small number of nodes are performing I/O operations at the same time.

Bill Broadley continued with a cluster architecture example of building nodes with a local hard drive that is used just for swap and scratch space. The operating system (OS) is mounted via NFS. If a hard drive dies, that node can be rebooted without the swap or scratch space so the node is still useful until it is repaired or replaced. Bill also pointed out that this can reduce the administrative costs of managing, patching, backing up, troubleshooting, etc. of the nodes because the operating system and configuration is centrally stored on the main server for the cluster. You could extend this idea for a large number of nodes, by using a small number of dedicated servers who's only purpose is to be a NFS server for the OS for neighboring nodes. Some of the larger clusters are known use this type of approach.

Beowulf: Multicast or snowball copy?

Rene Storm asked the Beowulf mailing list about efficiently distributing large files throughout a cluster on the 18th of August 2003. Several people responded with some recommendations. Mikael Fredriksson recommended BitTorrent. Felix Rauch recommended Dolly, a tool he wrote to clone hard-drives. Felix also mentioned that by using an appropriate TCP chaining approach, the solution is often faster than a multicast approach. Felix also gave a reference to two papers about copying files efficiently across clusters. Donald Becker echoed that using multicast can cause problems, particularly over larger clusters. He also mentioned that a geometrically cascading copy can work very well. Thomas Lange suggested rgang, a Python code that uses a tree structure for copying files to nodes or executing commands on many nodes.

The tools mentioned here, and others, can be used for copying files to nodes as part of a user's job or as part of administration where files need to be copied to the nodes for installation or updating. They also can be used as part of an administrator's toolkit to execute commands on all of the nodes in the cluster.

PVFS-Users: Performance issue with PVFS

PVFS is a Parallel Virtual FileSystem that combines the hard drive space on compute nodes or dedicated I/O nodes to form a single filesystem. PVFS is built on top of any standard Linux filesystem. There are various ways to access the filesystem including the traditional Linux filesystem commands and ROMIO (I/O for MPI). More importantly, PVFS can greatly increase I/O operations compared to traditional filesystems.

While a bit old, there was a discussion on the PVFS-Users mailing list that started on the 16th of July 2003 by Craig Tierney, about some performance issues he was having on two PVFS systems with read performance being significantly slower than write. There was some initial discussion about tuning some of the parameters in PVFS, such as the stripe size, and tuning some parameters on the nodes to improve performance. It had a small affect on the read performance. Then Craig and Rob Ross discovered that by switching the client code from using mmap() calls to read data, to using sendfile() calls, the performance greatly increased. These two function, mmap() and sendfile() are functions within Linux that can be used for reading or writing data. They are low-level functions that PVFS uses. The performance boost was very good resulting in about 50% improvement for one cluster, and almost 300% improvement for another cluster. In fact, the PVFS team made this an option in the current version of PVFS so people can tune their PVFS setup for maximum performance. The team is also looking at adding some tuning suggestions to their FAQ (Frequently Asked Questions) and perhaps writing a code that can be used to measure PVFS performance as a sort of "baseline" to help people tune PVFS. Of course, the ultimate code for tuning is the user's application.

An important point about open cluster software like PVFS is the fact that it can be tuned for maximum performance based on the user's application.

PVFS-Users: Size of PVFS Clusters

On August 11th 2003, Nathan Poznick asked the PVFS-Users mailing list what the largest PVFS (Parallel Virtual FileSystem) cluster was, what was the configuration, and how was the performance? Rob Ross responded that he knew of systems with 10's of servers, 100's of clients, and 10's of TeraBytes (TB) of storage that have performed quite well. He indicated that over Myrinet, they were able to achieve about 3.5 GigaBytes (GB) per second aggregate bandwidth using IDE (Integrated Drive Electronics) disks. Troy Baer responded that their current configuration has 16 I/O nodes, with just a little under 10 TB of raw disk. They have achieved sustained performance of about 1.4 to 1.6 GB/sec for simple tests (e.g. ROMIO perf) and about 100-400 MegaBytes (MB) per second for real applications such as the ASCI Flash I/O code or the NAS BTIO code. He included a link to a paper discussing the results. Crag Tierney also discussed a system he was sizing for a bid on a project that used multiple FC (Fibre Channel) arrays to keep the number of IOD (Input/Output Daemon) nodes to a minimum. He said that he has seen performance of 150 MB/sec with nodes that have a reasonably fast disk. In his opinion, he did not see and reason that a PVFS filesystem could not be designed for 10 GB/sec performance.

Keep in mind that the numbers mentioned are how fast parallel programs read or write data to a PVFS File System and are a function of I/O server, interconnect, and cluster nodes. Searching for articles on the Internet will show how PVFS was used to achieve over 1 GB/sec performance on simple clusters. If I/O performance is a crucial part of your application or if you just want to test PVFS, then visit the website. The code is Open-Source and has been used in production at several sites.

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Jeff Layton has been a cluster enthusiast since 1997 and spends far too much time reading mailing lists.