Article Index

Hadoop has been growing clusters in datacenters at a rapid pace. Is Hadoop the new corporate HPC?

Apache Hadoop has been generating a lot of headlines lately. For those that are not aware, Hadoop is an open source project that provides a distributed file system and MapReduce framework for massive amounts of data. The primary hardware used for Hadoop is clusters of commodity servers. File sizes can easily be in the petabyte range and can easily use hundreds or thousands of compute servers.

Hadoop also has many components that live on top of the core Hadoop File System (HDFS) and the MapReduce mechanism. There are some interesting similarities between an HPC cluster and a Hadoop cluster, but how much cross over between the two disciplines depends on the application. Hadoop strengths lie in the shear size of data it can process. It also is highly redundant and can tolerate node failures without halting user jobs.

Who Uses Hadoop

There are many organizations that use Hadoop on a daily basis including Yahoo, Facebook, American Airlines, Ebay, and many others. Hadoop is designed to allow users to manipulate large unstructured or unrelated data sets. It is not intended to be a replacement for a RDMS. For example, Hadoop can be used to scan weblogs, on-line transaction data, or web content. All of which are growing each year.

Map Reduce

To many HPC users, MapReduce is a methodology used by Google to process large amounts of web data. Indeed, the now famous Google MapReduce paper was the inspiration for Hadoop.

The MapReduce idea is quite simple and when used in parallel can provide extremely powerful search and compute capabilities. There are two major steps in the MapReduce process. If you have not figured it out, there is a "Map" step followed by a "Reduce" step. Some are surprised to learn that mapping is done all the time in the *nix world. For instance consider:

grep "the" file.txt

In this simple example we are "mapping" all the occurrences, by line, for the word "the" in a text file. Though the task seems somewhat trivial, suppose the file was one terabyte. How could we speed up the "mapping" step. The answer is also simple, break the file into chunks and put a different chunk on a separate computer. The results can be combined when the job is finished as there are no dependencies within the map step. The popular MPIBLAST takes the same approach by breaking the human genome file into chunks and performing the "BLAST mapping" on separate cluster nodes.

Suppose we wanted to calculate the total number of lines with "the" in them. The simple answer is to pipe the results into wc (word count):

grep "the" file.txt | wc -l

We have just introduced a "Reduce" step. For our large file parallel mode, each computer would perform the above step (grep and wc) and send the count to the master node. That in nutshell is how MapReduce works. Of course, there are more details, like key-value pairs and "the shuffle", but for the purposes of this discussion, MapReduce can be that simple.

With Hadoop, large files are placed in HDFS which automatically breaks the file into chunks and spreads them across the cluster (usually in a redundant fashion). In this way, parallelizing the map process is trivial, all that needs to happen is to place s separate map process on each node with file chunk. The results are then sent to reduce processes, which also run on the cluster. As you can imagine, large files produce large amounts of intermediate data, thus multiple reducers help keep things moving. There are some aspects to the MapReduce process worth noting:

  • MapReduce can be transparently scalable. The user does not need to manage data placement or the number of nodes used for their job. There is no dependency on the underlying hardware.
  • Data flow is highly defined and in one direction from the map to the reduce, there is no communication between independent mapper or reducer processes.
  • Since processing is independent, fail over is trivial, a failed process can be restarted, provided the underlying file system is redundant like HDFS
  • MapReduce, while powerful, does not fit all problem types.
In order to understand the difference between Hadoop and a typical HPC cluster, let's compare several aspects of both systems.

Hardware

Many modern HPC clusters and Hadoop clusters use commodity hardware, which are primarily x86 based servers. Hadoop clusters usually include a large amount of local disk (used for HDFS nodes) while many HPC clusters rely on NFS or a parallel file system for cluster-wide storage. In the case of HPC, there are diskless and diskfull nodes, but in terms of data storage, a separate group of hardware is often used for a global file storage. HDFS daemons run on all nodes and store data chunks locally. It is does not support the POSIX standard. Hadoop is designed to move the computation to the data, thus the need for HDFS to be distributed throughout the cluster.

In terms of networking, Hadoop clusters almost exclusively use Gigabit Ethernet (GigE). As the price continues to fall, newer systems are starting to adopt 10 Gigabit Ethernet (10GigE). Although, there are many GigE and 10GigE HPC clusters, InfiniBand is often the preferred network.

Many new HPC clusters are using some form of acceleration hardware on the nodes. These additions are primarily from NVidia (Kepler) and Intel (Phi). They require additional programming (in some cases) and can provide substantial speed-up for certain applications.

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.