Talks, Conferences and Workshops

10 April, Amsterdam: Emerce Conversion
11 April, Amsterdam: Workshop Big Data
5 May, Berlin: API days
29 June, Rotterdam: International Symposium on Forecasting

Tuesday, April 15, 2014

It's Significant! 100 years of fud

What's wrong with statistics?

FUD stands for "fears, uncertainties and doubts". This fantastic comic by xkcd illustrates all the confusions and uncertainties when researchers and statistitians are communicating with managers and pr-advisors on concepts such as statistical significance and repeated testing and sizing of statistical experiments.

Misunderstandings and Critics

(extracted from wikipedia)

Despite the ubiquity of p-value tests, this particular test for statistical significance has been criticized for its inherent shortcomings and the potential for misinterpretation.

The data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is rejected, or the null hypothesis cannot be rejected at that significance level (which however does not imply that the null hypothesis is true). In Fisher's formulation, there is a disjunction: a low p-value means either that the null hypothesis is true and a highly improbable event has occurred, or that the null hypothesis is false.

However, people interpret the p-value in many incorrect ways, and try to draw other conclusions from p-values, which do not follow.

The p-value does not in itself allow reasoning about the probabilities of hypotheses; this requires multiple hypotheses or a range of hypotheses, with a prior distribution of likelihoods between them, as in Bayesian statistics, in which case one uses a likelihood function for all possible values of the prior, instead of the p-value for a single null hypothesis.

The p-value refers only to a single hypothesis, called the null hypothesis, and does not make reference to or allow conclusions about any other hypotheses, such as the alternative hypothesis in Neyman–Pearson statistical hypothesis testing. In that approach one instead has a decision function between two alternatives, often based on a test statistic, and one computes the rate of Type I and type II errors as α and β. However, the p-value of a test statistic cannot be directly compared to these error rates α and β – instead it is fed into a decision function.

There are several common misunderstandings about p-values.

The p-value is not the probability that the null hypothesis is true, nor is it the probability that the alternative hypothesis is false – it is not connected to either of these. In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of Bayesian and classical approaches shows that a p-value can be very close to zero while the posterior probability of the null is very close to unity (if there is no alternative hypothesis with a large enough a priori probability and which would explain the results more easily). This is Lindley's paradox. But there are also a priori probability distributions where the posterior probability and the p-value have similar or equal values.

The p-value is not the probability that a finding is "merely a fluke." As calculating the p-value is based on the assumption that every finding is a fluke (that is, the product of chance alone), it cannot be used to gauge the probability of a finding being true. The p-value is the chance of obtaining the findings we got (or more extreme) if the null hypothesis is true.

The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called prosecutor's fallacy.

The p-value is not the probability that replicating the experiment would yield the same conclusion. Quantifying the replicability of an experiment was attempted through the concept of p-rep. The significance level, such as 0.05, is not determined by the p-value. Rather, the significance level is decided by the person conducting the experiment (with the value 0.05 widely used by the scientific community) before the data are viewed, and is compared against the calculated p-value after the test has been performed. (However, reporting a p-value is more useful than simply saying that the results were or were not significant at a given level, and allows readers to decide for themselves whether to consider the results significant.)

The p-value does not indicate the size or importance of the observed effect. The two do vary together however–the larger the effect, the smaller sample size will be required to get a significant p-value (see effect size).

How to keep Data Science "Scientific"

We talk a lot these days about data science, and how it will pave our paths with beautiful insights and unexpected new relations and connections in our given datasets, and even across datasets.

But how to maintain the "Science" part in "Data Science"? After some time working in this field I appreciate more and more the critical thinking which has characterized the progress in science.

Here below some food for thoughts in a slideshare presentation

Hypothesis, facts, prove and/or disprove the thesis. This is how science has progressed in the past centuries. This method has been formalized by Popper and categorize as non-science all disciplines where the statements cannot be falsified. In other words, if a statement cannot be disproved, we cannot talk of science, since there is no mechanism to left to verify the solution or to prove it wrong.

When that happens the argument can still be accepted, but not scientifically accepted. Ways of accepting or refuting a non falsifiable statement are for instance based on aesthetic, authority or pragmatic or philosophical considerations. All valid but not scientific. This applies for instance to statements in the disciplines of politics, teology, ethics, etc.

Science has definitely progressed since then. For instance, Bayesian networks and statistical inductions are currently part of the arsenal of the (data) scientist weapons. But, no matter how the baseline is set, critical thinking and a rigorous method are definitely helpful in understanding the results produced by science in particular when this is based on large amount of data and computational in nature, rather than formula/model driven.

Data Science has currently many different connotations. On one side it praises the "artistry", the genius of laying out connections between disciplines and concepts. This is a truly great aspect of scientists and creativity is definitely very welcome in all data science profiles.

With the fun of creating new insights and new data golden eggs, a data scientist has to put up with those annoying criteria of reproducibility, falsifiability and peer reviewing. Sometimes these elements are postponed or left behind in name of the artistry. Granted,  it's just hard to find metrics and baselines in order to compare models and data science solutions.  But the scientific method has proven to be solid over the centuries and has proven to allow factual scientific discussion  between scientists and a to allow selection between models based on objective agreed criteria.

Big & Fast: A quest for relevant and real-time analytics

The retail market demands now more than ever to stay close to our customers, and to carefully understand what services, products, and wishes are relevant for each customer at any given time.

This sort of marketing research is often beyond the capacity of traditional BI reporting frameworks. In this talk, we illustrate how we team up data scientists and big data engineers in order to create and scale distributed analyses on a big data platform. Big Data is important but speed and the capacity of the system of reacting in a timely fashion is becoming also increasingly important. Which components and tools can help us creating a big data platform which is also fast enough to keep up with the events affecting the customers' behaviour?

When the marketing goes from push to ask (permission marketing), it's the user the one who grants the interaction. Permission marketing is the user's grant of being heard. In order to be effective and lead to conversions, it's important to provide the right suggestions, at the right time. This is largely determined by the user's context when the interaction happens.

The context has some temporal scope, for instance for a given person it is at the same time: slow changing: the defining characteristic of a person, his/her personality, and memories, and past actions fast changing: events which influence the persons behaviour and life, trends, ads, news, fast paced information from friends, family and co-workers

A Distributed Data OS. 

The given users' context is increasing in size and complexity. Thanks to the cluster computing power, what could be done once a month on a single server, can be done now everyday.

Distributed computing come in all sort of flavours but hadoop has now become the affirmed de-facto open-source platform for distributed data processing. Why? It's convenient, resilient, and offers a good trade-off between costs (recurring and one-off) vs resources (both computing as well as storage)

However this does not take into account the user's recent events fast path in the analysis. Hadoop, currently can operate on 100's of terabytes of data, but it requires time to process this information and this big data slow path does not match the latency of responsive/reactive web applications and APIs.

The fast data path: how to process events

For that a good component could be an Akka cluster. It's a reactive and distributed near real-time framework which can process millions of events event in modest sized clusters

Advantages: it scales horizontally (can run in cluster mode), it makes maximum use of the avaliable cores/memory, the processing is non-blocking, thread is re-used if computation cannot proceed because of I/O of other blocking operations, the computation can be parallelized across many actors and therefore reduce the overall latency of the system.

Cassandra: a low latency data store

How to connect the two systems: Cassandra as a distributed memory key value store. Why? it's a low latency data store the system is resiliant, with no single points of failure and distributed across multiple nodes and data center for high availability Cassandra can be used as a "latency impedance" between the fast path and the slow path.

This sort of architecture is often referred in the literature as the lambda architecture (although the original version proposed by Natahn Marz refer to the combination hadoop / storm, while here I am describing a system based on hadoop / cassandra / akka). Cassandra can be used to store models parameters, preliminary results from hadoop, as well as fast data and events.

Monday, March 24, 2014

Counting words in and outside hadoop

Counting word occurrences in documents seem to be the new pasttime for software and data scientists when they need to lay back a little. There are many ways how to do it, and each language has its own pro and cons.

In this article, I illustrate basic word count when executed by single node, single threaded environments in various languages.

Next to that, I illustrate how to compute word count as a distrbuted processing using Hadoop as a distributed run-time system.

Even in hadoop itself, with so many tools such as pig, hive, cascading, scalding, etc Thing are getting even more interesting since tez, spark, and storm are pushing the boundaries of hadoop beyond the original file-based map reduce paradigm.
Feel free to clone/download the project at

It was quite a jorney to discover how to do word count in so many languages and frameworks. Definitely I was standing on the shoulder of giants here. So big kudos and due references to those great teams and programmer who have come with both scenarios and tutorials

The file used for the wordcount, named lorem.txt, is "Far far away" generated by the the fun begin.

Build, Compile, and Run on hadoop

My finding on installing and running mp reduce on hadoop: I installed hadoop core, single node installation version 2.3.0 running with oracle java 7, and scala 10.3 on elementary OS Luna with linux kernel 3.2.0-60-generic.

Hadoop map reduce  as well as streaming with python was easy to reproduce. In order To run Pig I needed to recompile it using my java 7 install.  Cascading with gradle worked once I understood that I needed to compile the jar using java 6.

Scalding took me a bit longer, since I did not understand well the mechanics of the "fat jar" compile at first. After digging a bit into the scala assembly plugin, that demo went through as well.


R, shell, Scala, Python on local files non-distributed version: that was a breeze.

Hadoop: Pig
  • pros: concise, easy to start, good set of functions
  • cons: not easy to debug, mixed capitalization case-sensitive/insensitive can be confusing

Hadoop: Hive
  • pros: excellent for sql like tables and operations
  • cons: not ideal for text processing and wordcount

Hadoop: mapreduce in java
  • pros: very structured, no need for high-level tools
  • cons: verbose, need to understand the fine prints of mappers, reducers, and combinators

Hadoop: cascading
  • pros: clear design of a dataflow with pipes, sources, sinks, coincise java programming
  • cons: introduce an extra level of absttraction on top of mapreduce

Hadoop: scalding
  • pros: very concise, efficiently binds scala function to distributed cascading
  • cons: not all type bindings are available, requires to write idiomatic functional programming

Tez, Spark and Storm: haven't had time to include them, but definitely these might be interesting to put in the mix, in particular now that hadoop is going yarn.






Most of the scripts here below require that you have the input file available in hdfs.

Mapreduce example jar

This one is a bit of a warm up. Not really coding, but a quick check to verify that the hadoop installation is actually working, and that the available $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar toolbox is working.

Mapreduce using Pig

Mapreduce using Hive

Mapreduce using java and the hadooop core libraries

This is probably the most famous script for mapreduce. The mother of all mapreduce program using just the map and reduce hadoop core libraries

Streaming using python

A easy way to deal with mapreduce is to use the streaming jar package, also part of the installation, and define a map and a reduce action in python.

Mapreduce using cascading and java

Mapreduce using scalding and scala

Wednesday, March 12, 2014

Data Science: Try this at home!

Getting started

Big Data and APIs

Big Data: more users, more interactions, more data science
photo: by m_namkung
Many complex data science problems, such as sparse matrix factorizations have very concrete applications today. These algorithms constitute  the fundation of machine learning engines which help us - the users - to filter, categorize and tag information.Two approaches can be taken in order to process the data: by analysing  large chunks of data (mostly translates into batch, lenghty operations) or per-request event-driven processing (mostly translates into responsive, reactive API and streaming computing in realtime). 

These two approaches can be combined in a unified architecture where large inferences and broad analysis are performed in batch mode on a Hadoop Yarn framework, while faster, reactive analytics can be performed by a actor and event based distributed comupting in Akka.

Those two run-time frameworks have very different latency, throughput, data and compute caracteristsics. No/New SQL distributed database technologies can be successfully used to bridge those two run-time frameworks. 



Parallel systems

Libraries and Tools

Data science and machine learning

Tuesday, March 4, 2014

Big Data: why you cannot pass on this offer any more.

Why big data? Traditional marketing is failing. People are today proficient in shunting intrusive marketing. Via phone, and via email of course. But also in-app, in-video, in-game unsolicited marketing is rejected time after time.

As Seth Godin has predicted,  this revolution is called permission marketing. The king is not the seller, but more than ever before, the user. And this marketing revolution is centered on understanding the user, with his history, his dream and goals, his desires, his weaknesses, his feelings.

This is indeed possible, but it requires a much more data, and more importantly requires data from many different sources and it requires to correlate data across profiles, matching and comparing one user profile to all other profiles. This is quite different than the traditional aggregated analytics performed so far in BI departments. Personalized permission marketing requires your business to extract targeted analytical features which define the user's sentiment and intentions among hundreds of less relevant characteristics. And this is in a nutshell a big data analytical approach to marketing.

Failing to understand the customer, and failing to implement a data-driven, customer centric, analytical big data engine is equivalent to surrender your business and declare failure. It’s not a matter of if you are going to provide personalized big data-driven marketing and products. It’s only a matter of when this is happening, hoping that your business can catch up with the aggressive crowd of young companies with their data-driven products and services.


Does your business need big data?
I would go for "yes" ...


Infographics porn

Infographics is definitely the new spreadsheet porn. Seriously, the pun is intended.

After the initial scan, I would suggest you to take a closer look to an infographics analytics well done. Jon Millward has data mined more than 10,000 profiles of porn stars in the Internet Adult Film Database. The result is the big data of porn, a set of facts that would surprise you and your perceptions about sex films on the Internet. This infographics also shows the much insight you can get from such a collection: the evolution of the industry, the target audience and the evolution of the products and the demands.

You might also wonder how big is porn on the Internet. Take a site like YouPorn for instance. Quoting ExtremeTech's very insightful article:
To put that 800Gbps figure into perspective, the internet only handles around half an exabyte of traffic every day, which equates to around 50Tbps — in other words, a single porn site accounts for almost 2% of the internet’s total traffic. There are dozens of porn sites on the scale of YouPorn. The Internet really is for porn.
A very penetrating infographics by Jon Millward

Monday, March 3, 2014

Unsupervised Learning: Model  Selection and Evaluation

In terms of model validation and selection, supervised learning is easy. You know what the outcome is, you device the model, your train it, you test it, maybe several times, then you cross validate it. And there you go, you know how good your model is. You can create model competitions, you can improve your model. Great.

But, wait a second. How do you achieve the same for unsupervised learning?

How to interpret unsupervised learning: from scikit
In unsupervised learning, there is no reference result. How good is the model in determining classes? Are those classes what you actually wish as a user? How to interpret the quality of an model, so when is a model more discriminating than another?

The term “unsupervised” refers to the fact that there is no “target” to predict and thus nothing resembling an accuracy measure to guide the selection of a best model

This means that there is no objectively "correct" clustering algorithm, but as it was noted, "clustering is in the eye of the beholder." The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentall. Unless there is a mathematical reason to prefer one cluster model over another.

Unsupervised learning can be looked at as several equivalent definitions:
  • Finding groups in data
  • Finding patterns in data
  • A form of data compression 
  • A form of multi-dimensional reduction
Regardless of the definition we choose, one central matter when dealing with unsupervised learning is how to measure the quality of the clustering.What does confidence mean in the context of features/groups mapping for unsupervised learning?

Unsupervised learninng can be done in may ways, the most common being neural networks, clustering (k-means, etc), and dimensionality reduction techniques such as PCA. Let us call the features, regardless on how they were extracted as clustering.

Clustering classification

Clusterings can be roughly distinguished as: 
  • hard clustering: each object belongs to one cluster only. It's a onto mapping, each samples can be associated to one and only one cluster.
  • soft clustering: also known as fuzzy clustering. Each object belongs to each cluster to a certain degree (e.g. a likelihood of belonging to the cluster) 

Measuring quality

Internal evaluation methods

One way is to create some internal evaluation method. This method does not realy from any external knowledge, but simply is a way of describing a set of desired characteristics of the mapping.
  • by the definition of an optimization function (for instance, minimize SSE in k-means)
  • by creating an error metric

SSE method
plot the sum of squared error for different clusters
SSE will monotonically decrease as we increase the number of clusters
The knee points on the curve suggest good candidates for an optimal number of clusters

Spectral clustering
Measure/maximizes the eigen gap

Penalty Method
Bayesian Information Criterion

Stability based method

• Stability: repeatedly produce similar clusterings on data originating from the same source.
• High level of agreement among a set of clusterings the clustering model (k) is appropriate for the data
• Evaluate multiple models, and select the model resulting in the highest level of stability.

External evaluation methods

If true class labels (ground truth) are known, the validity of a clustering can be verified by comparing
the class labels and clustering labels.

Convert it to a supervised model

  • by means of a panel (most of the time of humans / experts)
  • by means of ground truth (for instance accessing other data which classify the samples)

Some methods are developed for evaluating unsupervised models against ground truth. Refer to Rand Index and Normalized Rand Index. Also Purity and Normalized Mutual Information index can be used to asses the quality of the model.



Clustering methodologies

Density estimation

Kernels separation

Slides and presentations