Gigabit Switches, Channel Bonding, Opterons, and Large Memory Allocations

Published on Sunday, 18 December 2005 19:00
Written by Jeff Layton
Hits: 6394
In this installment of the Best of the Beowulf Mailing List we look at Gigabit switches, channel bonding, Opterons, and large memory allocations from the Beowulf Mailing List. You can consult the archives for the actual conversations.

Choosing a Gigabit Switch

On November 11, 2003, Keyan Mehravaran said that he was connecting eight dual Xeon PC's with on-board Gigabit NICs (Network Interface Cards). He asked about the relative advantages of a managed switch versus an unmanaged switch and asked about channel bonding Gigabit NICs. An unmanaged switch is one where the management of the switch is entirely internal and cannot be configured. A managed switch is one that is configurable via an interface such as a web browser or serial terminal. Donald Becker replied that in his opinion managed switches are frequently not a good choice because the switch can be set to work in "flawed modes" that can cause problems in the future. He also explained in another posting that he thought auto-negotiation, which unmanaged switches do by default, was a good thing because it was automatic, transparent, and extensible. Moreover he pointed out that most switches are now using Ethernet flow control and users don't know this because it's configured during auto-negotiation. The bottom line is that things just seem to work better.

The discussion about managed versus unmanaged switches turned to an interesting topic when Mark Hahn mentioned that he spoke to a large switch manufacturer and asked about some HPC features such as adding a special Quality of Service (QOS) tag for small packets to give them preferential treatment. He also mentioned that he would like to get performance statistics from the switch for the ports. Donald Becker mentioned that QOS tags already existed but cautioned about using them on LANs (Local Area Networks). In his opinion they are very good for multi-traffic, multi-path WANs (Wide Are Networks) where high-volume bulk traffic might block low-volume traffic such as telnet sessions. He then explained why QOS tags in a LAN would not be a good idea.

{mosgoogle right}

There was also a discussion of Channel Bonding where two NICs are combined to get twice the bandwidth. Rafael Orinoco pointed out that to do this within a single switch, it needs to be capable of handling multiple VLANs (Virtual LAN) within the switch. Donald Becker pointed out that you could use multiple switches to avoid this problem, with one NIC going to one switch and the other NIC to the other switch. Donald also shared that Channel Bonding Gigabit Ethernet (GigE) NICs would only marginal increase bandwidth because you are likely to get some out-of-order packets on the receiving side that could consume more CPU time to reassemble the packets. There is another option, however. Scali has an MPI implementation that bypasses the Linux kernel. Scali claims that by doing this they can Channel Bond GigE NICs and get 50-60% of the bandwidth of the second NIC.

Opteron Thoughts

There was a very good discussion on the Beowulf mailing list about Opteron systems and what to look for in a system. The discussion began on the 24th of November, 2003, with Derek Richardson asking about sources of information for tuning the Linux kernel for Opteron processors. The ever present Donald Becker asked for some clarification about tuning, but mentioned that the easiest performance improvement could come from a proper memory DIMM (Dual In-line Memory Module) configuration to match the application layout. You may recall that the Opteron has its own local memory controller but that for multi-CPU systems, each CPU can see all of the memory. Don pointed out that understanding how the memory slots are filled and what BIOS options are used can make a huge difference. He mentioned that simply by choosing to interleave memory in the BIOS can produce a 30% difference on a dual CPU system.

Greg Lindahl supported this observation and said that Opteron systems can run counter to what you might think about other systems. In particular, filling all of the memory slots on a single-channel memory system can actually improve performance. He also went to mention that the 2.6 kernel is better (faster) than the 2.4 kernel on Opteron processors.

Bill Broadley also mentioned that he has seen significant speedups by adjusting the node interleaving and the memory interleaving. He said that he has a benchmark that shows about a 2 Gigabytes/sec (GB/sec) for a single Opteron and about 3 GB/sec for a dual Opteron if properly configured. Another good benchmark for testing memory bandwidth is the STREAM benchmark. It is a widely accepted benchmark and tests several aspects of memory access. It can also be used to test dual CPU systems. However, during this discussion about Opterons, several people such as Greg Lindahl, Mike Snitzer, and Egan Ford pointed out that there are utilities for forcing a process to a specific CPU. This feature allows you to adequately test the memory access speed of each processor.

Allocating Memory in a 2.4 kernel

On December third 2003, Roland Krause asked a fairly common question on the Beowulf mailing list. Roland asked why he couldn't allocate more than 2GB of memory in a continuous chunk using the malloc function. He was running a code on a dual Athlon system with 4 GB of memory. The system is running Red Hat 9 with a 2.4.20 kernel. He built the kernel with CONFIG_HIGHMEM and CONFIG_HIGHMEM4G turned on so the kernel can address all 4 GB of memory. He could allocate a total amount of memory close to the 3GB limit. However, he could only allocated about 2GB in a continuous chunk and wanted to get closer to 3 GB in a single chunk. The unpatched Linux kernel compiled with the typical 4 GB options (as mentioned previously), splits memory into two parts. The first 1 GB belongs to the kernel and the remaining 3GB belongs to user applications. This is typically called the 3/1 VM (Virtual Memory) split.

Mark Hahn quickly replied and gave us all a lesson in how memory works on 32-bit Linux. He replied that the address space should look something like the following:

  0-128MB        zero page
  128MB + small  program text
                 sbrk heap (grows up)
  1GB            mmap areana (grows up)
  3GB - small    stack base (grows down)
  3GB-4GB        kernel direct-mapped area

Mark explained that the 2 GB continuous chunk was allocated in the mmap arena and its associated 2 GB limit. The other 1 GB was allocated in the sbrk heap. He also explained that by statically linking your code, you could bypass the mmap area entirely and see almost 3 GB for the heap or stack. He also posted a very useful C code to explain and demonstrate all of these concepts.

Mark also mentioned that it is possible to move these memory limits around by modifying your kernel. He said that you could move the default C base by adjusting the value of TASK_UNMAPPED_BASE in the kernel source and rebuilding and installing the kernel. He also said you could squeeze the 3 GB barrier for user applications by adjusting the value of the kernel variable TASK_SIZE. There is a popular patch available that moves TASK_SIZE so that you have 3.5 GB available for user space applications and 0.5 GB for the kernel. Mark also mentioned a patch that makes TASK_UNMAPPED_BASE a variable in the /proc filesystem that you can adjust on the fly so you do not have to recompile the kernel.

Finally, Mark commented that is a patch available that eliminates the kernel's 1 GB area entirely. This is the 4G:4G patch from the well-known kernel coder, Al Viro. The patch allows a full separate 4GB VM for the kernel and separate, full (and per-process) 4GB VM's for user applications. However, the patch can come at a price because it can impact kernel performance.

Roland Krause replied back to the list that he modified his kernel changing the value of the C variable (moving it down) so that he could allocated enough memory for this application.

This short discussion provides some valuable insight into how memory is allocated. Even though this isn't necessarily cluster specific, it does help you tune the nodes in your cluster for your application. It also points out how convenient it is to have the source code for the kernel. It allows you to adjust certain kernel parameters to match your needs. If anyone has any comments on kernel 2.6 memory management, please add a comment below. A good summary of Linux Memory can be found here.

{mosgoogle right}

Sidebar One: Links Mentioned in Column

Beowulf Mailing List

Scali MPI Connect

Channel Bonding

LAN switches

Linux Memory Management

STREAM benchmark

This article was originally published in ClusterWorld Magazine. It has been updated and formated for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Jeff Layton has been a cluster enthusiast since 1997 and spends far too much time reading mailing lists. He occasionally finds time to perform experiments on clusters in his basement. He also has a Ph. D. in Aeronautical and Astronautical Engineering and he's not afraid to use it.