Gigabit Switches, Channel Bonding, Opterons, and Large Memory Allocations | Beowulf List

In this installment of the Best of the  Beowulf Mailing List we look at Gigabit switches, channel bonding, Opterons, and large memory allocations from the Beowulf Mailing List. You can consult the archives for the actual conversations. 

Choosing a Gigabit Switch

On November 11, 2003, Keyan Mehravaran said that he was connecting eight
dual Xeon PC's with on-board Gigabit NICs (Network Interface Cards). He
asked about the relative advantages of a managed switch versus an
unmanaged switch and asked about channel bonding Gigabit NICs.
An unmanaged switch is one where the management of the switch is
entirely internal and cannot be configured. A managed switch is one
that is configurable via an interface such as a web browser or serial
terminal. Donald Becker replied that in his opinion managed switches
are frequently not a good choice because the switch can be set to work
in "flawed modes" that can cause problems in the future. He also
explained in another posting that he thought auto-negotiation, which
unmanaged switches do by default, was a good thing because it was
automatic, transparent, and extensible. Moreover he pointed out that
most switches are now using Ethernet flow control and users don't
know this because it's configured during auto-negotiation. The bottom
line is that things just seem to work better.

The discussion about managed versus unmanaged switches turned to an
interesting topic when Mark Hahn mentioned that he spoke to a large
switch manufacturer and asked about some HPC features such as adding
a special Quality of Service (QOS) tag for small packets to give them
preferential treatment. He also mentioned that he would like to get
performance statistics from the switch for the ports. Donald Becker
mentioned that QOS tags already existed but cautioned about using them
on LANs (Local Area Networks). In his opinion they are very good for
multi-traffic, multi-path WANs (Wide Are Networks) where high-volume
bulk traffic might block low-volume traffic such as telnet sessions.
He then explained why QOS tags in a LAN would not be a good idea.

There was also a discussion of Channel Bonding where two NICs are
combined to get twice the bandwidth. Rafael Orinoco pointed out
that to do this within a single switch, it needs to be capable of
handling multiple VLANs (Virtual LAN) within the switch. Donald Becker
pointed out that you could use multiple switches to avoid this
problem, with one NIC going to one switch and the other NIC to the
other switch. Donald also shared that Channel Bonding Gigabit
Ethernet (GigE) NICs would only marginal increase bandwidth because
you are likely to get some out-of-order packets on the receiving
side that could consume more CPU time to reassemble the packets.
There is another option, however. Scali has an MPI implementation
that bypasses the Linux kernel. Scali claims that by doing this
they can Channel Bond GigE NICs and get 50-60% of the bandwidth
of the second NIC.

Opteron Thoughts

There was a very good discussion on the Beowulf mailing list about
Opteron systems and what to look for in a system. The discussion
began on the 24th of November, 2003, with Derek Richardson asking
about sources of information for tuning the Linux kernel for Opteron
processors. The ever present Donald Becker asked for some clarification
about tuning, but mentioned that the easiest performance improvement
could come from a proper memory DIMM (Dual In-line Memory Module)
configuration to match the application layout. You may recall that
the Opteron has its own local memory controller but that for
multi-CPU systems, each CPU can see all of the memory. Don pointed
out that understanding how the memory slots are filled and what
BIOS options are used can make a huge difference. He mentioned
that simply by choosing to interleave memory in the BIOS can
produce a 30% difference on a dual CPU system.

Greg Lindahl supported this observation and said that Opteron systems
can run counter to what you might think about other systems. In
particular, filling all of the memory slots on a single-channel
memory system can actually improve performance. He also went to
mention that the 2.6 kernel is better (faster) than the 2.4 kernel
on Opteron processors. 

Bill Broadley also mentioned that he has seen significant speedups
by adjusting the node interleaving and the memory interleaving.
He said that he has a benchmark that shows about a 2 Gigabytes/sec
(GB/sec) for a single Opteron and about 3 GB/sec for a dual Opteron
if properly configured. Another good benchmark for testing
memory bandwidth is the STREAM benchmark. It is a widely accepted
benchmark and tests several aspects of memory access. It can
also be used to test dual CPU systems. However, during this
discussion about Opterons, several people such as Greg Lindahl,
Mike Snitzer, and Egan Ford pointed out that there are utilities
for forcing a process to a specific CPU. This feature allows you
to adequately test the memory access speed of each processor.

Allocating Memory in a 2.4 kernel

On December third 2003, Roland Krause asked a fairly common question
on the Beowulf mailing list. Roland asked why he couldn't
allocate more than 2GB of memory in a continuous chunk using
the malloc function. He was running a code on a dual
Athlon system with 4 GB of memory. The system is running Red Hat 9
with a 2.4.20 kernel. He built the kernel with CONFIG_HIGHMEM
and CONFIG_HIGHMEM4G turned on so the kernel can address
all 4 GB of memory. He could allocate a total amount of memory
close to the 3GB limit. However, he could only allocated about
2GB in a continuous chunk and wanted to get closer to 3 GB in
a single chunk. The unpatched Linux kernel compiled with the
typical 4 GB options (as mentioned previously), splits memory
into two parts. The first 1 GB belongs to the kernel and the
remaining 3GB belongs to user applications. This is typically
called the 3/1 VM (Virtual Memory) split.

Mark Hahn quickly replied and gave us all a lesson in how memory
works on 32-bit Linux. He replied that the address space should
look something like the following:

  0-128MB        zero page
  128MB + small  program text
                 sbrk heap (grows up)
  1GB            mmap areana (grows up)
  3GB - small    stack base (grows down)
  3GB-4GB        kernel direct-mapped area

Mark explained that the 2 GB continuous chunk was allocated in
the mmap arena and its associated 2 GB limit. The other
1 GB was allocated in the sbrk heap. He also explained that
by statically linking your code, you could bypass the mmap
area entirely and see almost 3 GB for the heap or stack. He also
posted a very useful C code to explain and demonstrate all of
these concepts.

Mark also mentioned that it is possible to move these memory limits
around by modifying your kernel. He said that you could move the
default C base by adjusting the value of TASK_UNMAPPED_BASE
in the kernel source and rebuilding and installing the kernel. He
also said you could squeeze the 3 GB barrier for user applications
by adjusting the value of the kernel variable TASK_SIZE. There
is a popular patch available that moves TASK_SIZE so that you
have 3.5 GB available for user space applications and 0.5 GB for the
kernel. Mark also mentioned a patch that makes TASK_UNMAPPED_BASE
a variable in the /proc filesystem that you can adjust on the fly
so you do not have to recompile the kernel.

Finally, Mark commented that is a patch available that eliminates
the kernel's 1 GB area entirely. This is the 4G:4G patch from the
well-known kernel coder, Al Viro. The patch allows a full separate
4GB VM for the kernel and separate, full (and per-process)
4GB VM's for user applications. However, the patch can come at a
price because it can impact kernel performance.

Roland Krause replied back to the list that he modified his kernel
changing the value of the C variable (moving it
down) so that he could allocated enough memory for this application. 

This short discussion provides some valuable insight into how memory
is allocated. Even though this isn't necessarily cluster specific,
it does help you tune the nodes in your cluster for your application.
It also points out how convenient it is to have the source code for
the kernel. It allows you to adjust certain kernel parameters to
match your needs. If anyone has any comments on kernel 2.6 memory management, please add a comment below. A good summary of Linux Memory can be found
 here.

Sidebar One: Links Mentioned in Column

Beowulf Mailing List

Scali MPI Connect

Channel Bonding

LAN switches

Linux Memory Management

STREAM benchmark

This article was originally published in ClusterWorld Magazine. It has been updated and formated for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Jeff Layton has been a cluster enthusiast since 1997 and spends far
too much time reading mailing lists. He occasionally finds time to perform
experiments on clusters in his basement. He also has a Ph. D. in Aeronautical
and Astronautical Engineering and he's not afraid to use it.