My own personal cluster

I am now the proud operator of a three-node Hadoop cluster. Hadoop is a Java-based MapReduce implementation, an open-source version of the technology that allows Google to perform massive parallel computing on cheap, commodity hardware. I set it up with my OS X laptop running the NameNode (the distributed filesystem coordinator) and JobTracker (the distributed computing coordinator) and two Debian servers as the slaves.

The first problem I ran into was the requirement for the Hadoop directory to be in the exact same location on all machines. Since OS X and Debian have very different directory structures, I ended up creating extra directories on the Linux nodes to match my OS X setup—very sketchy, but I couldn’t find a better solution. Once I got that figured out, the cluster started humming along nicely.

The next problem was an error message: java.io.IOException: Incompatible namespaceIDs. This is a known issue and was quickly resolved by following some simple directions.

I installed the brilliant Dumbo module (more info on Dumbo) to help me deal with the Python side of things. Now I can write Python MapReduce programs and run them on my mini-cluster. Why would I ever want to do this? I’m not really sure, but darn do I ever feel cool.

Leave a Reply

:mrgreen: :neutral: :twisted: :shock: :smile: :???: :cool: :evil: :grin: :oops: :razz: :roll: :wink: :cry: :eek: :lol: :mad: :sad: