Upcoming talks and demos:

Jupyter Con - New York 23-25 Aug

View Natalino Busa's profile on LinkedIn

Principal Data Scientist, Director for Data Science, AI, Big Data Technologies. O’Reilly author on distributed computing and machine learning.

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Thursday, February 13, 2014

Beyond hadoop: in-memory map reduce with the Berkeley Data Analytics Stack (BDAS)

One of the strongest points of the original hadoop is an immutable distributed file system. However, this strength is also one of the biggest limitations of hadoop. Everything you do on hadoop must be read from a file and must be written back to a file.

The Berkeley Data Analytics Stack (BDAS) by AMPlab offers an alternatives to this scenario. In a nutshell, it provides a mechanism to promote hadoop data to in-memory data structures which can be evaluated and transformed used a distributed in-memory run-time system.

At the core of the stack, there is spark. It converts the data structures into a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop HDFS files or by transforming other RDDs.

Spark works essentially as a data lineage tool, it transform the RDD's into new structures. The transformation and the action of the in-memory data structures are currently available with python and scala bindings. These two languages are a great match for map reduce in-memory operations because they provide native map and list comprehensions function and immutable transformations, which can be bound directly on the RDD in-memory data structures.

Spark works as follows: first a chain of transformations is defined (think of a filter on a specific field, or a arithmetic operation on the elements of the dataset), then the results are extracted. Since the evaluations are lazy, no computation in memory is triggered up to the last moment that result data is actually consumed.

Shark is an extension on the base spark stack, and it provides an in-memory distributed sql. Shark can be positioned as an in-memory alternative to hive, and roughly competing with proprietary solutions such as memsql. By using the underlying Spark layer, Shark can leverage in memory sql queries across a large number of nodes with a considerable run-time speed compared to the currewnt solutions based on hive and hadoop map-reduce.

Other components have been developed lately around the core idea of RDD and spark, a distributed in-memory machine learning library MLlib, and a streaming extension of the spark base library.

BDAS stack as of Feb 2014: https://amplab.cs.berkeley.edu/software/

Spark streaming can periodically schedule operations and transformations on stream like RDD collected in memory across the nodes and produce streaming / rolling results. The package makes use of in-memory computing. It is reactive, lazy, in-memory and distributed.


The berkeley stack pushes hadoop beyond the limits of file based processing, exposing a set of libraries and components which can operate directly on in-memory distributed data structures. In particular, distributed streaming computing, in-memory sql, and distributed machine learning could significantly revolutionize the way the hadoop ecosystem in the coming years, and unleash the potential of distributed in-memory computing in hadoop clusters. So far BDAS looks like a great open-source alternative to proprietary solutions such as memsql, and paraccel and the like and definitely a very welcome extension on the current hadoop yarn stack.