Principal Data Scientist, Practice Lead for Data Science, AI, Big Data Technologies at Teradata. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Monday, March 24, 2014

Counting words in hadoop

Counting word occurrences in documents seem to be the new pasttime for software and data scientists when they need to lean back a little. There are many ways how to do it, and each language has its own pro and cons.

In this article, I illustrate basic word count when executed by single node, single threaded environments in various languages.

Next to that, I illustrate how to compute word count as a distrbuted processing using Hadoop as a distributed run-time system.

Even in hadoop itself, with so many tools such as pig, hive, cascading, scalding, etc Thing are getting even more interesting since tez, spark, and storm are pushing the boundaries of hadoop beyond the original file-based map reduce paradigm.
Feel free to clone/download the project at

It was quite a jorney to discover how to do word count in so many languages and frameworks. Definitely I was standing on the shoulder of giants here. So big kudos and due references to those great teams and programmer who have come with both scenarios and tutorials

The file used for the wordcount, named lorem.txt, is "Far far away" generated by the the fun begin.

Build, Compile, and Run on hadoop

My finding on installing and running mp reduce on hadoop: I installed hadoop core, single node installation version 2.3.0 running with oracle java 7, and scala 10.3 on elementary OS Luna with linux kernel 3.2.0-60-generic.

Hadoop map reduce  as well as streaming with python was easy to reproduce. In order To run Pig I needed to recompile it using my java 7 install.  Cascading with gradle worked once I understood that I needed to compile the jar using java 6.

Scalding took me a bit longer, since I did not understand well the mechanics of the "fat jar" compile at first. After digging a bit into the scala assembly plugin, that demo went through as well.


R, shell, Scala, Python on local files non-distributed version: that was a breeze.

Hadoop: Pig
  • pros: concise, easy to start, good set of functions
  • cons: not easy to debug, mixed capitalization case-sensitive/insensitive can be confusing

Hadoop: Hive
  • pros: excellent for sql like tables and operations
  • cons: not ideal for text processing and wordcount

Hadoop: mapreduce in java
  • pros: very structured, no need for high-level tools
  • cons: verbose, need to understand the fine prints of mappers, reducers, and combinators

Hadoop: cascading
  • pros: clear design of a dataflow with pipes, sources, sinks, coincise java programming
  • cons: introduce an extra level of absttraction on top of mapreduce

Hadoop: scalding
  • pros: very concise, efficiently binds scala function to distributed cascading
  • cons: not all type bindings are available, requires to write idiomatic functional programming

Tez, Spark and Storm: haven't had time to include them, but definitely these might be interesting to put in the mix, in particular now that hadoop is going yarn.






Most of the scripts here below require that you have the input file available in hdfs.

Mapreduce example jar

This one is a bit of a warm up. Not really coding, but a quick check to verify that the hadoop installation is actually working, and that the available $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar toolbox is working.

Mapreduce using Pig

Mapreduce using Hive

Mapreduce using java and the hadooop core libraries

This is probably the most famous script for mapreduce. The mother of all mapreduce program using just the map and reduce hadoop core libraries

Streaming using python

A easy way to deal with mapreduce is to use the streaming jar package, also part of the installation, and define a map and a reduce action in python.

Mapreduce using cascading and java

Mapreduce using scalding and scala