Collaborative Filtering, Hadoop and the Hazards of Copy-Paste


I've been working on a new App idea lately - a recommender for Android programs. Basically, it looks at what you have installed (and possibly ratings) and recommends other applications you might like by using the recommendations of other people in the same way as Amazon or the various music services - in a word - collaborative filtering.

There are different ways to do collaborative filtering, but they are all expensive when you get a lot of records to sort through. Two common approaches are 1) Calculate the similarity of users, and recommend apps liked by similar users, or 2) Calculate the similarity of apps, and recommend apps similar to ones the user likes. I am trying the second way, known as item-based collaborative filtering or the model-based approach, which allows for fast queries at the cost of an expensive offline step that re-computes the item similarities every once in awhile.

My initial tests in Python, based on the very interesting book "Programming Collective Intelligence" quickly became too slow with just a few thousand users and apps. Because there are already around 5,000 apps and a few million users of Android (with many more every day), there's no way the script would be able to handle the future growth of the platform.

Enter MapReduce and Hadoop. The explanation is better left to the pros, but simply, MapReduce is a way of parallelizing certain types of computations across many computers and then merging the final results. With the availability of Amazon Web Services, which allows you to rent a cluster of computers by the hour, it becomes possible to run a prohibitively expensive computation once every few days for just a couple of dollars. There are several different MapReduce frameworks out there, but I choose to try Hadoop, which is available on Amazon's services and used heavily by Yahoo and many others.

There will be a lot more to say about Hadoop as I gain more experience. But all-in-all, it is pretty fun to re-think an algorithm, even just a little bit, to make it suitable for MapReduce. I *think* I have a correct implementation of Item-Based Collaborative Filtering running on my tiny 2-node cluster and it's pretty cool!

One snag I ran into while trying to get my cluster running using the ubiquitous WordCount example for Hadoop. Like most people, I copy-pasted the source from the Hadoop tutorial and tried to run it. It ran, great! So then instead of reading the rest of the documentation, I immediately tried to modify it. Eventually, I ended up trying to make the simplest change - to return Text instead of IntWritables from the Map operation and -- WTF!?! I spent HOURS trying to figure out why there was a ClassCastException. So for other poor souls trying to modify the WordCount example, there are 3 things you need to do:

First, get the method signatures right. The Mapper has to output Text and the Reducer has to consume Text (Eclipse will help with that, of course)

Second, add the lines: wzxhzdk:0 to the main() method. These tell Hadoop that the Mapper is not using the default, IntWritable, for output

Third, and crucially important, remove the line "conf.setCombinerClass(Reduce.class);". Discovering that I needed to remove that single line took me about half a day, digging through the logs and Googling everything I could think of until I discovered this thread. Because it was part of the example, I assumed it was Hadoop boiler-plate that was essential -- it's not, it's an optimization. The Combiner is kind of like a pre-Reduce phase that saves time by combining in-memory results instead of writing them to disk and combining them later. The Combiner needs a method signature that accepts the output of the Mapper and is still suitable as input to the Reducer. Otherwise, it chokes.

So is the peril of the copy-paster who runs code without really understanding all of it ~~