Principal Data Scientist, Practice Lead for Data Science, AI, Big Data Technologies at Teradata. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Tuesday, July 1, 2014

Big & Fast: A quest for relevant and real-time analytics

The retail market demands now more than ever to stay close to our customers, and to carefully understand what services, products, and wishes are relevant for each customer at any given time.

This sort of marketing research is often beyond the capacity of traditional BI reporting frameworks. In this talk, we illustrate how we team up data scientists and big data engineers in order to create and scale distributed analyses on a big data platform. Big Data is important but speed and the capacity of the system of reacting in a timely fashion is becoming also increasingly important. Which components and tools can help us creating a big data platform which is also fast enough to keep up with the events affecting the customers' behaviour?

When the marketing goes from push to ask (permission marketing), it's the user the one who grants the interaction. Permission marketing is the user's grant of being heard. In order to be effective and lead to conversions, it's important to provide the right suggestions, at the right time. This is largely determined by the user's context when the interaction happens.

The context has some temporal scope, for instance for a given person it is at the same time: slow changing: the defining characteristic of a person, his/her personality, and memories, and past actions fast changing: events which influence the persons behaviour and life, trends, ads, news, fast paced information from friends, family and co-workers

A Distributed Data OS. 

The given users' context is increasing in size and complexity. Thanks to the cluster computing power, what could be done once a month on a single server, can be done now everyday.

Distributed computing come in all sort of flavours but hadoop has now become the affirmed de-facto open-source platform for distributed data processing. Why? It's convenient, resilient, and offers a good trade-off between costs (recurring and one-off) vs resources (both computing as well as storage)

However this does not take into account the user's recent events fast path in the analysis. Hadoop, currently can operate on 100's of terabytes of data, but it requires time to process this information and this big data slow path does not match the latency of responsive/reactive web applications and APIs.

The fast data path: how to process events

For that a good component could be an Akka cluster. It's a reactive and distributed near real-time framework which can process millions of events event in modest sized clusters

Advantages: it scales horizontally (can run in cluster mode), it makes maximum use of the avaliable cores/memory, the processing is non-blocking, thread is re-used if computation cannot proceed because of I/O of other blocking operations, the computation can be parallelized across many actors and therefore reduce the overall latency of the system.

Cassandra: a low latency data store

How to connect the two systems: Cassandra as a distributed memory key value store. Why? it's a low latency data store the system is resiliant, with no single points of failure and distributed across multiple nodes and data center for high availability Cassandra can be used as a "latency impedance" between the fast path and the slow path.

This sort of architecture is often referred in the literature as the lambda architecture (although the original version proposed by Natahn Marz refer to the combination hadoop / storm, while here I am describing a system based on hadoop / cassandra / akka). Cassandra can be used to store models parameters, preliminary results from hadoop, as well as fast data and events.