This sort of marketing research is often beyond the capacity of traditional BI reporting frameworks. In this talk, we illustrate how we team up data scientists and big data engineers in order to create and scale distributed analyses on a big data platform. Big Data is important but speed and the capacity of the system of reacting in a timely fashion is becoming also increasingly important. Which components and tools can help us creating a big data platform which is also fast enough to keep up with the events affecting the customers' behaviour?
The context has some temporal scope, for instance for a given person it is at the same time: slow changing: the defining characteristic of a person, his/her personality, and memories, and past actions fast changing: events which influence the persons behaviour and life, trends, ads, news, fast paced information from friends, family and co-workers
A Distributed Data OS.The given users' context is increasing in size and complexity. Thanks to the cluster computing power, what could be done once a month on a single server, can be done now everyday.
However this does not take into account the user's recent events fast path in the analysis. Hadoop, currently can operate on 100's of terabytes of data, but it requires time to process this information and this big data slow path does not match the latency of responsive/reactive web applications and APIs.
The fast data path: how to process eventsFor that a good component could be an Akka cluster. It's a reactive and distributed near real-time framework which can process millions of events event in modest sized clusters
Advantages: it scales horizontally (can run in cluster mode), it makes maximum use of the avaliable cores/memory, the processing is non-blocking, thread is re-used if computation cannot proceed because of I/O of other blocking operations, the computation can be parallelized across many actors and therefore reduce the overall latency of the system.