The Berkeley Data Analytics Stack (BDAS) by AMPlab offers an alternatives to this scenario. In a nutshell, it provides a mechanism to promote hadoop data to in-memory data structures which can be evaluated and transformed used a distributed in-memory run-time system.
At the core of the stack, there is spark. It converts the data structures into a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop HDFS files or by transforming other RDDs.
Spark works essentially as a data lineage tool, it transform the RDD's into new structures. The transformation and the action of the in-memory data structures are currently available with python and scala bindings. These two languages are a great match for map reduce in-memory operations because they provide native map and list comprehensions function and immutable transformations, which can be bound directly on the RDD in-memory data structures.
Shark is an extension on the base spark stack, and it provides an in-memory distributed sql. Shark can be positioned as an in-memory alternative to hive, and roughly competing with proprietary solutions such as memsql. By using the underlying Spark layer, Shark can leverage in memory sql queries across a large number of nodes with a considerable run-time speed compared to the currewnt solutions based on hive and hadoop map-reduce.
Other components have been developed lately around the core idea of RDD and spark, a distributed in-memory machine learning library MLlib, and a streaming extension of the spark base library.
|BDAS stack as of Feb 2014: https://amplab.cs.berkeley.edu/software/|
Spark streaming can periodically schedule operations and transformations on stream like RDD collected in memory across the nodes and produce streaming / rolling results. The package makes use of in-memory computing. It is reactive, lazy, in-memory and distributed.
ConclusionsThe berkeley stack pushes hadoop beyond the limits of file based processing, exposing a set of libraries and components which can operate directly on in-memory distributed data structures. In particular, distributed streaming computing, in-memory sql, and distributed machine learning could significantly revolutionize the way the hadoop ecosystem in the coming years, and unleash the potential of distributed in-memory computing in hadoop clusters. So far BDAS looks like a great open-source alternative to proprietary solutions such as memsql, and paraccel and the like and definitely a very welcome extension on the current hadoop yarn stack.