Cluster Networking: TCP/IP Over Ethernet

Article Index

Packets 'n' Protocols: Dark Secrets of Datagrams

Networks are critical components of parallel cluster supercomputers. While really advanced clusters (and big iron parallel supercomputers) use high performance dedicated function networks to connect processors and their associated memory, Beowulf-style cluster computers got their start with TCP/IP over humble Ethernet, and even today there are likely well over ten clusters running inexpensive TCP/IP over Ethernet to one cluster that uses a dedicated and expensive high end network.

Even cluster designs that do have a high end network for interprocessor communications typically have TCP/IP and Ethernet to handle the mundane traffic associated with accessing network disk, distributing tasks, and connecting to nodes for installation and management purposes. Most server-class motherboards come with one or more Ethernet interfaces built right in. It is truly ubiquitous.

{mosgoogle right} Basically, it is simply impossible to become an "expert" on cluster supercomputing without a working knowledge of TCP/IP over Ethernet, and such a knowledge should begin from the ground up. This column is devoted to teaching you the essential structure of a TCP/IP packet, also known as a "datagram", as it sits inside an Ethernet packet

We will begin our exploration by working our way up the first few layers of the ISO/OSI model we described in some detail in my last column. We won't pay a lot of attention to the physical medium per se, since Ethernet (more or less) runs over unshielded twisted pair, coaxial cable, fiber, or even wireless. Instead we'll start with the Ethernet interface itself.

Ethernet

Ethernet was originally developed at Xerox's famous Palo Alto Research Center (PARC), by Bob Metcalfe. It is a network that is largely defined by how it deals with Carrier Sense Multiple Access/Collision Detection (CSMA/CD). That is, by how adapters can share a medium.

It is impossible to communicate everything that you might need to know about Ethernet in a single column. Unfortunately, it is also impossible to provide you with a website that contains the complete, open specification for Ethernet protocols to fill in the inevitable gaps in the review below. This situation exists because "Ethernet" is defined not by the Request For Comments (RFCs) that constitute the truly open specification for TCP/IP but, rather it is defined by the IEEE 802 family of documents, in particular 802.3. These documents are not free. In order to read them, you must pay (and pay quite a lot) for them.

Much as I would like to spend a few thousand words indicating how annoying and evil I find this, I will refrain. IEEE is more than a professional society, it is a business, and a fairly successful one at that. They provide a valuable service, no doubt, but when I compare this definition of "open specification" with that of the RFC, I find it somewhat lacking.

Fortunately, in spite of this barrier to the actual core technical documentation it is not terribly difficult to find websites that provide fairly complete technical reviews of Ethernet. These sites provide most of the critical information without violating the actual copyright on the technical documents themselves. Since we are not interested in actually engineering an Ethernet adapter (a process that would almost certainly require the detailed specification) but rather in understanding how they work in general terms, they are all that we need. Some of these are collected in Useful Links below, others can be googled up in short order on your own.

To remain concrete, we'll stick to 10/100 Mbps (Megabits per second) Ethernet and describe its "Media Access Control (MAC) Frame" (Ethernet packet, IEEE 802.3 and 802.3u). The specification is different for gigabit Ethernet (802.3z) and jumbo frames (802.1q), but the idea is the same and differences are minor.

An Ethernet packet has the following format:

Figure One: Ethernet frame (packet)
  +----------+------------+-----------------+-------+
  |  7    1  |  6   6   2 |      16-1500    |   4   |
  | PRE  SFD | DST SRC LT |     Data/Pad    |  FCS  |
  +----------+------------+-----------------+-------+
    CSMA/CD     Header          Message        CRC     

There is a preamble that actually consists of eight bytes of alternating 0's and 1's that is not part of the actual packet -- it is used to grab the line and give receiving adapters time to synchronize to the incoming bit stream. It ends with a start-of-frame delimiter.

Think of this metaphorically as ringing a little bell before beginning to talk, with a rule that if you hear a ringing bell, you must remain silent until the bell-ringer is done talking. If you start ringing your bell but (because of e.g. speed of sound delays) you hear somebody else ringing theirs before the mandated interval of bell-ringing ends, you and the other bell-ringer (who has presumably heard yours as well) must both remain silent for a randomly chosen but short period of time before again trying to ring your bell and speaking your message. In Ethernet parlance, this is known as a "collision".

Although it sounds excessively polite and convoluted enough for a Swift dystopia, this collision resolving mechanism is robust and is actually a lovely way to stick dozens of people into a large room with messages to shout to one another at random times and ensure that no two messages are ever shouted out at the same time (which would garble information irretrievably). If only my three boys and all their friends would consent to carry a little bell...

The frame itself actually begins with a mandatory header that contains the six-byte destination address, the six-byte source address, and a two byte length/type descriptor that can be used to describe the type or length of the contents.

The addresses referred to here are Ethernet addresses, also known as MAC addresses, and are supposed to be unique at the hardware level across all Ethernet adapters in the world. In practice, they often aren't and can sometimes even be specified in software. Altering them in this way can lead (and has led in my direct experience) to spectacular networking failures. The fact that they can be altered at all creates exploitable security holes if they are used (as they often are) as a means of host identification.

Two tools that you might find useful for determining MAC addresses are /sbin/ifconfig> (for your own) and /sbin/arp (to determine the address associated with other systems on your network that have sent packets to your system). Read the man pages to see what they do.

The Ethernet header is followed by the packet contents, the actual data you wish to send. Note that there is a 46 byte minimum message length. If your actual data is shorter than this, it must be padded e.g. with zeros out to this minimum length.

Finally, the frame terminates with a frame check sequence, a 32 bit cyclic redundancy check (CRC) computed by the sending hardware and recomputed by the receiving hardware. They must match or the frame is rejected as damaged.

If one adds things up, one will see that the minimum length of an Ethernet packet is 64 bytes (plus the preamble, which is usually ignored) and the maximum length is 1518 bytes -- 14 header, 1500 data, and 4 CRC. The 8 byte bell-ringing period is not counted -- it is part of the minimum interpacket latency. Often the Ethernet header itself isn't counted in discussions as it is fixed for all encapsulated protocols. The 1500 byte maximum data component is called the maximum transmission unit (MTU).

The actual upper bound of the number of bytes that can be safely checked with the CRC is around 12000, a number much greater than the standard Ethernet MTU. It costs system resources to build a packet, and the more packets a message has to be broken into, the greater the overhead of sending the message. This potential scaling efficiency has led to an extension of Ethernet called jumbo frames that have a larger MTU. These larger packets can transmit data in pages or blocks that match those used by the kernel or important applications such as NFS (for example, in chunks of 4096 or 8192 bytes plus protocol header length) which increases speed and lowers the overhead. Jumbo frames are not supported by all hardware, but they should be able to coexist within reason with normal MTU frames.

An Ethernet packet could be used directly to transmit raw data. However, it rarely is. The reason is because Ethernet addresses are not hierarchically organized and hence are not routable in and of themselves over wide area networks -- you have to "be in the same room" and have to have an ARP (address resolution protocol, RFC 826) table to map Ethernet addresses to particular hosts. For example, the MAC address of the wireless adapter of my laptop is 00:04:5A:CE:7F:9B (six bytes, each represented as a hexadecimal number). This knowledge will not help you send it a packet, though, unless you happen to be on the same network as my laptop, which for all practical purposes means "inside my house". And you aren't.

Now, I'd really like for my laptop to be able to send packets to machines that are far away on different networks. To accomplish that, we need an "address" that:

  • is hierarchically organized, so addresses can be catenated into a network that is maskable. MAC addresses aren't, as a LAN might have lots of Ethernet adapters with completely different and unrelated addresses.
  • is hierarchically routable via a concatenation of hops between networks directed by local routing tables (don't worry, routing is beyond the scope of this article).
  • is name-resolvable, as I don't want to call my laptop 192.168.1.129 -- this is too hard to remember and might change if I carry my laptop from one place to another and connect it to different networks.

Basically, networks and systems on those networks should have human-recognizable, hierarchically organized names that resolve into machine maskable, hierarchically organized addresses that otherwise will function much like an Ethernet address functions. Also, we'd like to be able to establish reliable connections with certain abstractions. These requirements are the basis of the Internet Protocol (and the Internet itself) and are the subject of many RFCs.

    Search

    Login And Newsletter

    Create an account to access exclusive content, comment on articles, and receive our newsletters.

    Feedburner

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.