Upcoming talks and demos:

Jupyter Con - New York 23-25 Aug

View Natalino Busa's profile on LinkedIn

Principal Data Scientist, Director for Data Science, AI, Big Data Technologies. O’Reilly author on distributed computing and machine learning.

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Tuesday, February 11, 2014

Hadoop 2.0 beyond mapreduce: A distributed generic application OS for data crunching

Central on the original concept of Hadoop is the map-reduce paradigm/architecture. Mapreduce is  based on two entities: one Job Tracker and a series of  Task tracker (mostly one per data worker). This paradigm is powerful but this is only one way to accomplish distributed computing. This approach is batch oriented and is targeting crunching large files making the most use of data locality. However good mapreduce is for a specific class of distributed computing tasks, it is not a general pattern that applies well for all applications. 

 Rather than forcing the mapreduce paradigm on each application running on the cluster hadoop 2.0 and yarn focus on the idea how to separate the hadoop application (mapreduce) from a more general problem of resource monitoring and management (yarn).

 Central to hadoop map-reduce v1 is the job tracker. This is a single cluster agent which has to take care of several functions. It monitors tasks, it schedules tasks to task trackers, re-schedule them if the task tracker fails to execute the task.

 Hadoop 2.0 re-invents the concept of distributed computing on hadoop. Hadoop is more than mapreduce, in encompasses real time distributed computing (Tez, Storm), in-memory computing (e.g. Spark), graph database (Giraph), bigtable (Hbase), and more.

 In a way, moving forward from the traditional job-task tracker and map-reduce paradigm has been achieved by splitting the higher level function (map-reduce) from the resource manager/negotiator. In a nutshell, the idea of hadoop 2.0 is that the resource manager is redesign to work with any arbitrary higher level app. The new resource negotiator is introduced in the architecture as YARN (yet another resource negotiator).

from http://hortonworks.com/hadoop/yarn/

In a way YARN provides a data functional lake. A lake of distributed processes and functions. A data distributed scheduler which in facts transform the data lake in a process lake. A data distributed process OS.

 In YARN, when you submit a job you push the job to a scheduler which allocates resources according to a resource constraints scheduling algorithm of choice. The Resource Manager manages the Application Managers, which conversely have full control on the distributed task management: start, stop, and monitoring your function.

As you can see, map reduce can be described as a specific application manager on hadoop 2.0.
On each worker node, in the new architecture consists of:

  • A global Resource Manager
    global resource allocation
  • A Node Manager
    common to all the applications on that node
  • An Application Managera per-application manager per cluster, this is the core of  YARN
  • A Containera per-application collection of resources allocated for an application manager,
    and running on a node manager (you might have more one node running your application)

source: http://hadoop.apache.org/docs/current2/hadoop-yarn/hadoop-yarn-site/YARN.html

When the client wants to execute a new job. It will start by negotiating resources with the global  resource manager. The resource manager collects the information about available resources (containers) from the node managers.

After the resources are negotiated, the job is delegated to an instance of the application manager which "understand" the client job type. The Application manager spawns and monitors a number of containers, on a number of nodes.

The monitoring of container is globally coordinated by the application manager instance. However to node managers can locally tune and control the node resources, and the local containers.


Hadoop 2.0 and YARN will unleash a much wider range of distributed applications, compared to what has been available so far under the map-reduce file-based batch-base paradigm.

While everybody is free to package a new hadoop application, by writing a new application manager for that, the road map looks promising and many of the current application (storm,  giraph, spark, Tez) might be soon available as standard yarn 2.0 applications.

As of today only the map reduce application is available as a yarn application. When other components will migrate to yarn / hadoop v2 is still unclear.

References, Acknowledgments

Thanks to Rich Raposa from hortonworks for kindly answering so many question about YARN, HA, and the hadoop 2.0 architecture during strata 2014.