The Cranky Sysadmin A world of technology, fun, and ignorant rants.

March 19, 2010

Hadoop? What kind of a name is Hadoop?

Filed under: Programming,System Administration — Cranky Sysadmin @ 8:03 am

Briefly, as I understand it, hadoop is a distributed program which allows one to aggregate data by processing across a bunch (more than 2) of computers. The name apparently comes from the name the lead developer’s son gave to his stuffed toy elephant. It’s not just the name that I dislike though.

I’ve read that hadoop was inspired by google’s map-reduce architecture. Here is a youtube search. The first five hits are a google class on how map-reduce works. So what is hadoop? It seems to be a collection of java processes which implement map reduce across many servers. The components seem to include a distributed job scheduler, a distributed file system, and a framework for running generic jobs.

Why don’t I like hadoop? I state for the record that I only have a passing familiarity with hadoop from reviewing the docs and some modest amount of administration. My opinions are mostly uninformed.

Hadoop includes a distributed filesystem… written in java… This sends up a big red flag in my head. Filesystems are an OS level concern in my experience. Writing a distributed filesystem in java with no OS hooks seems on its face to start with an inefficient model. It’s also not a general purpose filesystem, so the usual unix (or windows) tools for dealing with filesystems can’t be used on HDFS. Many current super computers use the Lustre filesystem. Lustre has been around for more then a decade and is a very mature product. I’m not suggesting that Lustre is right for hadoop, but the list of claimed features for both is pretty similar. Since Lustre uses a POSIX filesystem interface, I can use all of my unix tools on it.

Job scheduling is another well traveled path that Hadoop developers decided to walk again. The hadoop developers have arguments against MPI and PVM schemes which are used in most Beowulf clusters. Maybe the arguments are valid, but I don’t see a reason why one couldn’t build on such a mature technology to produce an adequate scheduler. In fact people have built built such a scheduler.

On the other hand, HBase and the actual MapReduce components are novel. My main problem with hadoop is that they decided to build components which are readily available elsewhere, are probably more efficiently written, are certainly more mature, and integrate well into the OS. One of the advantages of open source projects is that one can borrow from other good projects. I think the interesting parts of hadoop would have advanced farther if they had used the well studied portions of beowulf clusters and other technologies as a base instead of building them.

1 Comment »

  1. >My opinions are mostly uninformed.

    Yup. :)

    Most systems (MPI, etc) are built around the idea the data is small, so moving the data to where the code is executing is not a huge cost. One of the key points with Hadoop is that it solves the opposite problem. It moves the code to where the data resides. Doing this requires significant changes up and down the stack.

    Comment by Anonymous — March 19, 2010 @ 8:33 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress