Clos networks, Numactl and Multi-Core chips, and Parallel Storage

Article Index

Vincent thanked Stuart but also wanted to know why the memory latency for dual-core CPUs was worse than for single-core CPUs. Stuart responded that, "The first thing to note is that as you add cpu's the cost of the cache snooping goes up dramatically. The latency of a 4 cpu (single core) Opteron system is (if my memory serves me correctly) around 120ns. Which is significantly higher than the latency of a dual processor system (I think it scales roughly as O(n^2) where n is the number of cpu's)." People generally don't know about "cache snooping" that Stuart mentions. This is a term to describe the cores all telling each other what they have in cache (accessing data in cache is faster than accessing in main memory even if the data is on another chip so they share data in their cache). Cache snooping is one reason you don't see large socket Opteron systems much anymore (I've been told that SGI uses hardware to help their cache snooping).

Stuart went to say, "Now, with a dual core system, you are effectively halving the bandwidth/cpu over the hyper transport AND increasing the cpu count, thus increasing the amount of cache snooping required. The end result is drastically blown-out latencies." A very good explanation of what is going on in the system. Mikhail Kuzminsky went on to discuss some details about the cache snooping (or cache coherency) on the Opteron. In his estimation,the 30% increase in memory latency that Vincent was seeing on the dual-core compared to a single-core is due to the cache coherency effects.

Unfortunately Vincent has an "abrupt" email style and disagreed with Stuart's explanation and accused him of mixing latency and bandwidth concepts. But Stuart went on to give some explanation of cache snooping. In particular he also discussed how latency and bandwidth can be linked together. A number of other people chimed in to say that Stuart was correct.

This was a very good discussion, albeit brief, about the importance of thinking of placement of numerically intensive tasks on multi-core systems, particular NUMA architectures. And guess what, this problem is only going to get more important as the number of cores per socket increases. In the case of MPI codes, some libraries have the ability to schedule jobs on relatively low loaded CPUs (Scali MPI Connect is one that comes to mind, but I know there are more). Also, Doug Eadline has talked about this problem in a number of his columns in Linux Magazine.

Parallel Storage

On Aug 11, 2005, Brian R. Smith, asked about using a parallel file system as a centralized file system for their clusters. He went on to mention that they wanted a common /home for all of their clusters to limit the copying of data. Brian also mentioned that they had been using PVFS2 with great success. In particular, he mentioned that it has been very stable (no unscheduled outages in 8 months as of the posting of the email).

Joe Mack responded that he thought AoE (ATA over Ethernet) was cheap (although he had no experience with it.

Then one of the PVFS2 developers, Rob Latham, jumped on the thread. He pointed out the typical access pattern of /home is much different and probably not appropriate for PVFS2. But, then again, PVFS2 wasn't designed for that. He suggested an NFS mounted /home on the clusters and then a /scratch using PVFS2.

{mosgoogle right}

Mark Hahn, also responded that he forced users on his clusters to put their "real" data in appropriate places. They gave each user a small quotes for /home but lots of space in other file systems: lots of space in /work which is shared across clusters, /scratch which is local to the nodes, etc. Mark also pointed out that you should pay attention to RAID schemes based upon the usage pattern intended for the file system.

Then David S. and Guy Coates started talking about creating a "storage cluster" that is a set of disparate hardware with some sort of cluster file system on top. David initially thought about using NBD, iSCSI, or something else to tie everything into a file system and then using GFS. Guy responded that you could indeed do this and they have done it using GPFS instead of GFS. He said that they have good reliability with this system.

This thread is very short, but it does point out some interesting things. First, cluster users and administrators, even 2 years ago, were starting to think about a centralized file system for their clusters. This makes a lot of sense because as data files grow, it doesn't make sense to move large amounts of data to and from various clusters, file systems, and even to the desktop of the users. It makes much more sense to have a centralized pool of storage and use that for the clusters. Since you have one pool of storage it also makes sense for it to be a high performing file system that can scale in terms of capacity and performance.

I think one of the more interesting ideas in this thread was how to organize your data. Mark Hahn pointed out that there is more than one way to skin the storage cat and his way seems to work well for his users. There have been other discussions on the Beowulf list about organizing your data. There is really something to be said for sitting down and determining what most of the data looks like (size, type, etc.) and then match the file system layout to your data pattern.

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.


Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer (donations gladly accepted) so he can store all of the postings to all of the mailing lists he monitors. He can sometimes be found lounging at a nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).

    Search

    Login And Newsletter

    Create an account to access exclusive content, comment on articles, and receive our newsletters.

    Feedburner

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.