Upcoming talks and demos:

Codemotion - Amsterdam - 16 May
DevDays - Vilnius - 17 May
Strata - London - 22 May

View Natalino Busa's profile on LinkedIn
Principal Data Scientist, Director for Data Science, AI, Big Data Technologies. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Wednesday, April 1, 2015

Scaling Streaming Computing

A write-up on streaming computing. Consider it another attempt to provide a listing of issues, tools, algorithms, protocols and programming paradigms specific for scaling and distributing streaming computing over multiple nodes/resources.


The last two years we have seen more and more signs that the world is shifting from aggregated data to streaming data processing. Regardless the domain, marketing, security, operations fast on-the-fly data analytics pipelines are replacing systems loaded with pre-computed analytics results.

Three main factors are driving this change.

- Actionable analytics
- Faster model training and tuning
- A unified analytics paradigm

Actionable analytics

So far most of the web applications are just catalogs.We are entering an era where the timing and the context at times are important to provide relevant result. Most of the interactions are becoming immediate, real-time, and require adequate and timely responses.

Faster model training

Data analysis tends to lose its value as the get dated. Models tend to decrease in power and accuracy as they age. Moreover, recent data in many cases keeps the complexity of the model low, since model fitting using data lagging behind the facts have usually to trade precision (low variance) for model complexity. Notifications, next best actions and fast responses are winning on batch, offline analytics and other delayed data analysis strategies.

A unified data analytics for batch and real-time

When Google introduced the map-reduce paper 2003 they were executing some hefty data processing on batches. While this paradigm works well for off line processing, it cannot be easily extended to fast data transiting through the system via low-latency APIs. This separation of operational/transactional  systems and analytical/batch systems has created a number of barriers, when it comes to analyzing data. Today it’s quite hard to migrate algorithms and models from the batch processing clusters to the reactive low latency layer. Streaming computing seems to offer a unified programming and data analytics paradigm for both batch as well as event-driven processing.

Some definitions

"Streaming" is defined as a process characterised by handling data items one by one. For instance, a python program reading a file line by line, matching some text and printing it out is a streaming application. Linux pipes and io redirects are also good example of streaming.

"Batch" is defined as a process where data items are handled as aggregated collections. In a batch system, the input would first be collected and then processed in a single go. Some examples of batched data applications are for instance file based processing. You read a file with all the records in it, at the end of the processing you have your result, for instance a new file.

Batch and streaming applications can be in many cases converted the one into the other. I can for instance stream out application events via syslog agents and push them to a kafka log queue. I can process these events one at a time via Storm (or Akka, or Samza for instance) and send out events to a web service via some API calls. This would be streaming in and streaming out a typical CEP (Complex Event Processing) scenario.


But I could also (2nd row in the picture above) process my log every hour with Spark and collect the results in a Cassandra table. Hence batch hourly events process and prepare a batch of hourly results.

I can also (3rd row in the picture) do the same by collecting my kafka logs to hourly files, e.g. in hdfs and run hive queries and produce hourly results back in hive/hdfs.

Throughput and latency

Remember that so far time was not our concern. When we bring time in the picture, we have to reason about throughput and latency.

Let's define "time budget" as the amount of time that you have available to finish your job. So, if you are within your time budget you are "on time". We can then say that this process is real-time. It does not have to mean milliseconds or seconds, just that it's appropriate to the use and within the given time budget.

"Hard real-time" systems are those system where the time budget constraint is guaranteed regardless the system load, "soft" or "near" real-time system cannot guarantee that but are meant to keep up on average with a given load within the expected latency and throughput.


Unfortunately, most of people use it as a synonym for "fast". And of course, fast is a very personal/contextual notion. Instead of talking about real-time or fast, I prefer to talk about reactive systems and divide them in groups according to their typical latency when computing data.

In the diagram here above you can see my way of looking at various distributed analytical systems, categorized by typical job latencies.

Batch vs Streaming computing

In practice, from an architectural perspective, some tools are fitter for batch/aggregated processing while others are more suited for streaming/event-based processing.


Batch oriented

Analytical tasks and OLAP jobs which require to aggregate over a large amount of data across a large window of time are typical batch tasks. This applications are best served by MPP systems and/or more recently by Hadoop systems, possibly using nodes with large amounts of memory and disk spindles.

Take, for instance, the map/reduce paradigm. I can scan 1 billion records and calculate the average of a given numerical field in those records. Hadoop map-reduce will start a job that will read in the input files, then apply the typical map shuffle sort reduce pattern and provide a result. To repeat this process of every new incoming record will be of course possible but terribly inefficient.

Hadoop, Spark, and MPPs such as Netezza, and Vertica have large job setup times (it varies from a few 100 msec to sever seconds), but once they start they can efficiently scan the given dataset.

Incidentally, Hadoop and Spark has also a similar batch oriented architecture albeit Spark is leveraging data flow transformations in memory rather than disks. Still, the problem that Hive, Tez, Spark, Hadoop MR solve is essentially the same and could be stated as follows: how can I perform a given query on a large amount of data in the most efficient way?

So, to summarize, batch systems are meant to query large amounts of data, and are not a good match for transactional workloads and fast reactive high throughput applications with latency budgets in the order of milliseconds.

Micro-Batches oriented

Beware that according to the above definition Spark Streaming is a batch system. Although a very fast one (some refer to that as micro-batching). Recent data analytics can be  more relevant than large historical datasets, especially when you want to predict behaviour in the near future. This analysis usually is meant to define trends in the past minute, hour, day.

Micro-batching is batch processing where the setup times are smaller and usually only the most recent data is analyzed. If applied continuously, this setup effectively create a stream of query results, where each element is the outcome of the given query applied to a given subset of the data (usually a temporal window).

Streaming oriented

A streaming oriented architecture is optimized to handle high throughput of incoming events (10000s events per second) with a low latency (usually in the milliseconds range).

In order to achieve this numbers usually these architecture interacts are transactional and interact with a limited amount of data per event. When performing analytical tasks this architectures are mostly using iterative algorithms where each new event is used to re-evaluate the given statistical functions in an incremental way.

Scratching the surface:

open source

proprietary solutions

Which architecture?

Big Data Analytics

Latency is not an issue. Throughput is low. Data volume is huge. Welcome Big data solutions.  Hadoop but also MPP systems. These systems are ideal for reporting, after the facts analysis, model testing/building, off-line analysis. OLAP systems are batch analytical solutions which can scan large amount of data on large aggregating queries, if you can afford to wait for minutes and potentially hours.

Event Processing

Also known as CEP / Stream Computing / Event Computing. High Throughput and low latency are the thing. High velocity but you don't need to compute large data sets. Welcome to akka, spray, reactive asynchronous programming. This is the domain of  high speed APIs and event processing.

Streaming Processing

You might say that streaming computing is a stateful implementation of Event Processing. By adding state you can for instance have analytics which are updated extremely fast. This is for instance necessary if your model varies significantly when a new event comes in the system and when event scoring must be always performed using an up-to-date model. Typically used in trading platforms and high speed marketing ads bidding platforms


Transactional Processing

In this category, we find the current state-of-the-art in distributed OLTP. Some of them sport ACID transactions (such as voltDB) favoring Consistency over Availability in terms of CAP. These new breed of systems focus on a transaction data-centric approach, offering user defined procedures and triggers while performing high-throughput SQL(-like) transactions of the events. They usually rely on some distributed consensus protocol in order to reach serializability and transactionality on the given data transformation (paxos, raft). If you don’t want/can’t to bring the data to the code, you can also bring the code to the data. Some modern OLTP ACID systems, such as VoltDB, solve the problem of transactional operations and trigger custom processing in the form of in-place databased user-defined stored procedures.

Streaming analytics works well for batch data

While batch oriented architectures are not fit for event streaming scenario, the opposite is not true. You can in fact use streaming analytics to stream in data collected in batches.

Provided that the given queries can be performed in a streaming fashion, which conversely means that operations such as joins, filters, counts and unique count, grouping have to be implemented as incremental distributed operations.

This is the reason why streaming architectures are all at rage now. They offer the opportunity to have a uniform processing pipeline for both fast real-time data /and/ batch big data use cases.

Streaming computing, can provide a common programming framework for both real-time paths as well as batch analytics. This is a big win when compared to other frameworks such as the lambda architecture where batch and real-time subsystems require very different programming techniques and are hard to keep in sync, when models, algorithms and applications change.




Having the cake and eating it.

What if you want and to be able to process large amount of data as well as being up-to-date and keep track of the last events? The reason for going here is to keep the latency low while processing large amount of historical data. Several architectural patterns can be helpful there.

Lambda architecture

Advantages: it fills in short term / recent analytics with historical batch analytics. It requires  a merge layer to combine the query results of the slow and fast path. Furthermore it usually requires two different technologies and logics and algorithms for the fast and the slow layer, which must be kept in sync. One attempt to do so, with a unified programming paradigm has been tried at twitter with the algebird project.


Streaming Analytics

Streaming computing and Transactional Processing are somewhat similar, although stemming from two different schools. One is extending event processing with distributed stateful computation, while the other is expanding a distributed OLTP with embedded transactional user defined procedures.

Some call streaming/transactional processing, streaming analytics when par of the streaming computing data-flow is meant to collect and incrementally update some analytics (histograms, statistics, correlations) of the incoming data events.


It offers a uniform computing paradigm. cons it's not compatible with current big data solutions. Hard problem to tackle is that of fault-tolerant state replication.

Some very well designed streaming analytics solutions: