Clos networks, Numactl and Multi-Core chips, and Parallel Storage. | Beowulf List

Clos networks, Numactl and Multi-Core chips, and Parallel Storage. These are the days of our lives...
The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article I review some postings to the Beowulf list on Clos networks, The summer can be a bit slow, so I had to work to find some useful topics to discuss. :) But I found some threads that I thought were very useful to highlight. I think they present some very useful information that can be used by everyone.

Clos networks

There are many, many types of networks - almost as many as there are opinions abotu clusters! One that has been used for a while and perhaps people did not know about it, is the Clos network. A Clos network was originally designed for telephone switching and are used in computing networks for growing switch infrastructures to large port counts. On June 7, 2005, Joe Mack, a long time cluster expert, asked a question about Clos networks. In particular, he wanted to know if smaller Clos networks (3 layers and below) are blocking and if they have 5 or more layers, if they are non-blocking.

Greg Lindahl was the first to reply and explain that blocking and non-blocking Clos networks are all a myth. He explained that there is always interference on the network because of traffic from other nodes (the blocking and non-blocking comments were for the telecom industry).

There was some discussion about what is meant by "blocking" and "non-blocking" but then Patrick Geoffray from Myricom responded with a nice, clear, but long response. Patrick explained that blocking for Clos networks, particularly for Myrcom's implementation, you can get a "classical" wormhole blocking where the FIFO (First In, First Out) on the input port is full and the requested output port is used by another packet at that precise time. But he pointed out that this is independent of the dimension of the Clos. Patrick also explained how blocking can happen and how networks are designed to perhaps alleviate some of the possibility of blocking. Very interestingly, he also pointed ways to reduce the overall contention in a Clos network:

Load balance the routes over redundant paths
Fragment messages (i.e. pass smaller packets if possible)
Use route dispersion (Use multiple routes per destination)

Patrick went to say, "There is just a maximum number of nodes you can connect on a Clos of diameter n with N-port crossbars. In 2002, at the time of the referenced email, the Myrinet crossbar we shipped had 16 ports. So the scaling table was the following:"

Clos diameter:    1  3   5    7
Max nodes count: 16 128 1024 8192

Then Patrick said, "We now use a 32-port crossbar in the 256 ports box, so the crossover points have shifted a bit:"

Dimension (Clos): 3    5
Max nodes count: 512 (a lot, ouch my head hurts)

Then Patrick finished up with some theoretical discussion about Clos networks. It's a good email to read if you are interested in networking.

This was a nice, albeit short, discussion about networking. The interesting part is that there is more than one way to skin a cat, so to speak, for cluster network design. If you are at all interested in networking for clusters, be sure to read this thread

Numactl and Multi-Core chips

The entire world is multi-core now. I fully expect the next Christmas toy craze, whatever it is, will have toys with a couple of cell chips in there performing image recognition, speech synthesis, and robotic movement. (Could I make a cluster out of Teddy Ruxpin 2? Never mind).

A discussion started on the beowulf list about performance on a dual-core Opteron 275 node (the discussion even dared to talk about Gaussian performance, but that is verboten). But the discussion started talking about processor afinity, locking a process to a specific core, and the tools that surround this. While not the beginning of the discussion, a post by Igor Kozin, a very knowledgable poster to the beowulf list, talked about using 1 core per socket for some performance testing. He mentioned using 'taskset' for that purpose. Then Mikhail Kuzminsky mentioned the command, numactl. Both tools can be used to produce the desired outcome - pinning a process to a specific core and/or to make sure computationally intense tasks don't end up sharing the same socket (or core) if another socket (or core) is free.

This can be an issue with NUMA (Non-Uniform Memory Access) architectures such as the Opteron. With the Opteron you have a bank of memory tied to each socket. Plus the CPU has the memory controller on-board. But each socket (core) can access the memory on the other socket. You may have a numerically intensive process on one socket, but you may be accessing memory on another socket (not good for latency).

Mark Hahn, posted that the sched_setaffinity function actually does most of the work But he also pointed out that using affinity for all tasks on a node might not be a good idea. For example, he wondered if having one of the cores do the interrupt handling would be worthwhile (but he did point that many Opteron boards at the time tied the IO bridge to a single socket).

Then the discussion branched off to talk more about numactl and what it does for you. Vincent Diepeveen posted an email about some testing he did with numactl and measuring latency in his own program. He saw some differences in latency when using the various cores on the node (there were 4 cores on the board). Stuart Midgley then pointed out that numactl and indeed all affinity tools are not designed to help latency. He said that the 2.6 kernel does a pretty good job of putting the pages on the memory controller attached to the core that the process is using. But things are not perfect, so occasionally the memory pages can get spread around. He also pointed out that, "... the system buffer cache will get spread around effecting everyone." Then he went on to say, "With numactl tools you will force the pages to be allocated on the right memory/cpu. The processes buffer cache will also be locked down (which is another VERY important issue)... ." Stuart mentioned that he has used numactl tools to double the performance of his codes.

I have also seen the Linux kernel scheduler move processes around the cores. In particular, when a daemon wakes up and needs some CPU time, the kernel can move a numerically intensive task from one core and put it on another so the daemon can use the first core. This can be very costly in terms of performance since you now have two numerically intensive tasks sharing the same core. But, once the deamon does what it needs to and goes back to sleep, the kernel is pretty good about moving one of the numerically intensive tasks back to the free core. If you have a number of daemons on the node, then this is more likelyto happen. For MPI jobs this can impact overall performance. It can also affect the repeatability of performance. For example, if you run the same code a number of times, the spread in the performance is much larger when you don't use numactl than if you use it. So using numactl can help your performance and help get more repeatable timings. You can also help yourself by turing off as many daemons as possible on the compute nodes.

Vincent thanked Stuart but also wanted to know why the memory latency for dual-core CPUs was worse than for single-core CPUs. Stuart responded that, "The first thing to note is that as you add cpu's the cost of the cache snooping goes up dramatically. The latency of a 4 cpu (single core) opteron system is (if my memory serves me correctly) around 120ns. Which is significantly higher than the latency of a dual processor system (I think it scales roughly as O(n^2) where n is the number of cpu's)." People generally don't know about "cache snooping" that Stuart mentions. This is a term to describe the cores all telling each other what they have in cache (accessing data in cache is faster than accessing in main memory even if the data is on another chip so they share data in their cache). Cache snooping is one reason you don't see large socket Opteron systems much anymore (I've been told that SGI uses hardware to help their cache snooping).

Stuart went to say, "Now, with a dual core system, you are effectively halving the bandwidth/cpu over the hyper transport AND increasing the cpu count, thus increasing the amount of cache snooping required. The end result is drastically blown-out latencies." A very good explanation of what is going on in the system. Mikhail Kuzminsky went on to discuss some details about the cache snooping (or cache coherency) on the Opteron. In his estimation,the 30% increase in memory latency that Vincent was seeing on the dal-core compared to a single-core is due to the cache coherency effects.

Unfortunately Vincent has an "abrupt" email style and disagreed with Stuart's explanation and acused him of mixing latency and bandwidth concepts. But Stuart went on to give some explanation of cache snooping. In particular he also discussed how latency and bandwidth can be linked together. A number of other people chimed in to say that Stuart was correct.

This was a very good discussion, albeit brief, about the importance of thinking of placement of numerically intensive tasks on multi-core systems, particular NUMA architectures. And guess what, this problem is only going to get more important as the number of cores per socket increases. In the case of MPI codes, some libraries have the ability to schedule jobs on realtively low loaded CPUs (Scali MPI Conenct is one that comes to mind, but I know there are more). Also, Doug Eadline has talked about this problem in a number of his columns in Linux Magazine.

Parallel Storage

On Aug 11, 2005, Brian R. Smith, asked about using a parallel file system as a centralized file system for their clusters. He went on to mention that they wanted a common /home for all of their clusters to limit the copying of data. Brian also mentioned that they had been using PVFS2 with great success. In particular, he mentioned that it has been very stable (no unscheduled outages in 8 months as of the posting of the email).

Joe Mack responded tha he thought AoE (ATA over Ethernet) was cheap (although he had no experience with it.

Then one of the PVFS2 developers, Roc Latham, jumped on the thread. He pointed out the typical access pattern of /home is much different and probably not appropriate for PVFS2. But, then again, PVFS2 wasn't designed for that. He suggested an NFS mounted /home on the clusters and then a /scratch using PVFS2.

Marh Hahn, also responded that he forced users on his clusters to put their "real" data in appropriate places. They gave each user a small quotes for /home but lots of space in other file systems: lots of space in /work which is shared across clusters, /scratch which is local to the nodes, etc. Mark also pointed out that you should pay attention to RAID schemes based upon the usage pattern intended for the file system.

Then David S. and Guy Coates started talking about creating a "storage cluster" that is a set of disparate hardware with some sort of cluster file system on top. David initially thought about using NBD, iSCSI, or something else to tie everything into a file system and then using GFS. Guy responded that you could indeed do this and they have done it using GPFS instead of GFS. He said that they have good reliability with this system.

This thread is very short, but it does point out some interesting things. First, it points out that people, even 2 years ago, were starting to think about a centralized file system for their clusters. This makes a lot of sense because as data files grow, it doesn't make sense to move large amounts of data to and from various clusters, file systems, and even to the desktop of the users. It makes much more sense to have a centralized pool of storage and use that for the clusters. Since you have one pool of storage it also makes sense for it to be a high performing file system that can scale in terms of capacity and performance.

I think one of the more interesting ideas in this thread was how to organize your data. Mark Hahn pointed out that there is more than one way to skin a cat and his way seems to work well for his users. There have been other discussions n the bewoulf list about organizing your data. There is really something to be said for sitting down and determing what most of the data looks like (size, type, etc.) and then match the file system layour to this data pattern.

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer (donations gladly accepted) so he can store all of the positings to all of the mailing lists he monitors. He can sometimes be found lounging at a nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).