Upcoming talks and demos:

Codemotion - Amsterdam - 16 May
DevDays - Vilnius - 17 May
Strata - London - 22 May



View Natalino Busa's profile on LinkedIn
Principal Data Scientist, Director for Data Science, AI, Big Data Technologies. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Thursday, August 4, 2016

Streaming Analytics: A story of many tales

Streaming Analytics is a data processing paradigm which is gaining much traction lately, mainly because more and more data is available as events through web services and real-time sources rather than being collected and packaged in data batches.

Over the past years, we have seen a number of projects which are tackling this problem. Some are stemming from batch analysis and achieve streaming computing by scheduling data processing at a high rate (micro-batching). Other projects started by dealing with event processing and moved up to process bigger data collections.
Streaming analytics can be applied to both relational, NoSQL and document oriented data stores. The underlying datastore and persistency model has a profound effect on the nature of the streaming analytics solution devised.

Solutions, Tools, Engines ...

So, how can we classify streaming computing initiatives?
We can divide streaming computing into three different categories:
  • Streaming Data Solutions (operation and business users)
  • Tools (data analysts, data scientists)
  • Engines,  (data engineers, data scientists and statisticians)


Streaming processing engines

From here on, let's focus on the engines, as streaming computing engines are the foundation for any domain-specific streaming application. Streaming engines can be seen as distributed data OS'es. 
Streaming processing engines are best understood if grouped according to the following five dimensions:
  1. Generation:
    Development epoch and influences. Which came first, which ideas and lesson learned have influenced later projects?
  2. State:How is the state preserved? Is there a memory to disk spilling strategy? How are snapshots managed? What is the cost of a full state restore? Can events be replayed?
  3. Resiliency:How well can the distributed processing engine deal with failure? What about replaying incoming events? Does the architecture provide any guarantees about exactly-once data processing?
  4. Paradigm:
    Streaming computing paradigm. Event- or Batch- based? What is the sweet spot for a given engine in terms of streaming processing latency and throughput?
  5. Orchestration:
    Underlying resource orchestration, think for instance of Mesos, Yarn, etc.
Here below, a very opinionated chart of a number of bespoken streaming processing technologies which have been developed in the last years. Do please take the following picture with a good pinch of salt as it's more inspirational rather than being thorough or accurate in any ways.
The following list is roughly ordered by generation (a mix of both chronologically as in terms of evolution from previous paradigms/engines)

Flink

orchestrators: Yarn, Standalone, Cloud (EC2, GCE)
paradigm: event-driven
language: Java 47.2% C++ 24.7% Python 13.9%
community: github forks: 286
community: github first commit: Dec 13, 2015
url: https://flink.apache.org/
Apache Flink is a community-driven open source framework for distributed big data analytics, like Hadoop and Spark. The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. It aims to bridge the gap between MapReduce-like systems and shared-nothing parallel database systems. Therefore, Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink’s pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink’s runtime supports the execution of iterative algorithms natively.
Flink programs can be written in Java or Scala and are automatically compiled and optimized into dataflow programs that are executed in a cluster or cloud environment. Flink does not provide its own data storage system, input data must be stored in a distributed storage system like HDFS or HBase. For data stream processing, Flink consumes data from (reliable) message queues like Kafka.

Spark Streaming

orchestrators: Mesos, Yarn, Standalone, Cloud (EC2)
paradigm: batch
languages: Scala 77.2% Java 10.6%
community: github forks: 8261
community: github first commit: Mar 28, 2010
url: http://spark.apache.org/streaming/
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.
Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
This guide shows you how to start writing Spark Streaming programs with DStreams. You can write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2), all of which are presented in this guide. You will find tabs throughout this guide that let you choose between code snippets of different languages.
Heronorchestrators: Standalone
paradigm: event-driven
language: Java 47.2% C++ 24.7% Python 13.9%
community: github forks: 286
community: github first commit: Dec 13, 2015
url: http://twitter.github.io/heron/
Essentially Storm 2.0, compatible with the storm API, but with better support for state and resources management. Heron is designed to be fully backward compatible with existing Apache Storm projects, which means that you can migrate an existing Storm topology to Heron by making just a few adjustments to the topology’s pom.xml Maven configuration file.

Kafka streams

orchestrators: Standalone
paradigm: event-driven
language: Java 57.3% Scala 36.7%
community: github forks: 1987
community: github first commit: Jul 31, 2011
url: http://docs.confluent.io/3.0.0/streams/
Embedded distributed computing in kafka queues. Provides standalone “processor” client besides the existing producer and consumer clients for processing data consumed from Kafka and storing results back to Kafka. A processor computes on a stream of messages, with each message composed as a key-value pair. Processor receives one message at a time and does not have access to the whole data set at once. The k-stream processor api supports per-message processing and time-triggered processing. Multiple processors should be able to chained up to form a DAG (i.e. the processor topology) for complex processing logic.
Users can define such processor topology in a exploring REPL manner: make an initial topology, deploy and run, check the results and intermediate values, and pause the job and edit the topology on-the-fly. Users can create state storage inside a processor that can be accessed locally. For example, a processor may retain a (usually most recent) subset of data for a join, aggregation / non-monolithic operations.

Samza

orchestrators: Yarn, Standalone
languages: Scala 48.0% Java 47.8%
paradigm: event-driven
maturity: medium
community: github forks: 93
community: github first commit: Aug 11, 2013
url: https://samza.apache.org/
Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. Samza provides a callback-based “process message” API comparable to MapReduce. Samza manages snapshotting and restoration of a stream processor’s state. When the processor is restarted, Samza restores its state to a consistent snapshot. Samza is built to handle large amounts of state (many gigabytes per partition). Samza works with YARN to transparently migrate your tasks to another machine.

Akka Streams

orchestrators: Standalone
paradigm: event-driven
language: Scala 80.5% Java 17.4%
community: github forks: 1762
community: github first commit: Feb 15, 2009
url: http://doc.akka.io/docs/akka/2.4.9-RC1/scala/stream/
The purpose is to offer an intuitive and safe way to formulate stream processing setups such that we can then execute them efficiently and with bounded resource usage—no more OutOfMemoryErrors. In order to achieve this our streams need to be able to limit the buffering that they employ, they need to be able to slow down producers if the consumers cannot keep up. This feature is called back-pressure and is at the core of the Reactive Streams initiative of which Akka is a founding member.
For you this means that the hard problem of propagating and reacting to back-pressure has been incorporated in the design of Akka Streams already, so you have one less thing to worry about; it also means that Akka Streams interoperate seamlessly with all other Reactive Streams implementations (where Reactive Streams interfaces define the interoperability SPI while implementations like Akka Streams offer a nice user API).

Gearpump

orchestrators: Standalone, Yarn
paradigm: event-driven
language: Scala 83.0% JavaScript 9.0%
community: github forks: 17
community: github first commit: Jul 20, 2014
url: http://www.gearpump.io/overview.html

Gearpump is a real-time big data streaming engine. It is inspired by recent advances in the Akka framework and a desire to improve on existing streaming frameworks. Gearpump is event/message based and featured as low latency handling, high performance, exactly once semantics, dynamic topology update, Apache Storm compatibility, etc. Smaller project with a community at Intel.

Akka

orchestrators: Standalone
paradigm: event-driven
language: Scala 80.5% Java 17.4%
community: github forks: 1762
community: github first commit: Feb 15, 2009
url: http://akka.io/
Akka deliver the Actor Model as a programming abstraction to to build scalable, resilient and responsive applications. For fault-tolerance we adopt the "let it crash" model which the telecom industry has used with great success to build applications that self-heal and systems that never stop. Actors also provide the abstraction for transparent distribution and the basis for truly scalable and fault-tolerant applications.
Logstash / Elastic Searchorchestrators: Standaloneparadigm: event-driven (logstash), batch (elasticsearch)language: Ruby 83.8% Java 10.8%community: github forks: 1794community: github first commit: Aug 2, 2009url: https://www.elastic.co/
Log stash can also be used for streaming processing and aggregations. In particular log stash has a large collection of plugins for source extraction, data transformation, and output connectors. Elastic search can also be used to provide a micro batch analysis of the results. This approach is however weaker in terms of management of the resources and the guarantees for resiliency and durability.
Stormorchestrators: Standalone
paradigm: event-driven
language: Java 78.0% Clojure 10.9% Python 7.9%
community: github forks: 2428
community: github first commit: Sep 11, 2011
url: http://storm.apache.org/
Apache Storm is a distributed computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType the project was open sourced after being acquired by Twitter. It uses custom created “spouts” and “bolts” to define information sources and manipulations to allow batch, distributed processing of streaming data. The initial release was on 17 September 2011.
Storm application is designed as a “topology” in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end.
Trident is a high-level abstraction for doing realtime computing on top of Storm. It allows you to seamlessly intermix high throughput (millions of messages per second), stateful stream processing with low latency distributed querying. Trident has support for joins, aggregations, grouping, functions, and filters. In addition to these, Trident adds primitives for doing stateful, incremental processing on top of any database or persistence store. Trident has consistent, exactly-once semantics, so it is easy to reason about Trident topologies.