Upcoming talks and demos:

Codemotion - Amsterdam - 16 May
DevDays - Vilnius - 17 May
Strata - London - 22 May

View Natalino Busa's profile on LinkedIn
Principal Data Scientist, Director for Data Science, AI, Big Data Technologies. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Tuesday, April 15, 2014

It's Significant! 100 years of fud

What's wrong with statistics?

FUD stands for "fears, uncertainties and doubts". This fantastic comic by xkcd illustrates all the confusions and uncertainties when researchers and statistitians are communicating with managers and pr-advisors on concepts such as statistical significance and repeated testing and sizing of statistical experiments.

Misunderstandings and Critics

(extracted from wikipedia)

Despite the ubiquity of p-value tests, this particular test for statistical significance has been criticized for its inherent shortcomings and the potential for misinterpretation.

The data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is rejected, or the null hypothesis cannot be rejected at that significance level (which however does not imply that the null hypothesis is true). In Fisher's formulation, there is a disjunction: a low p-value means either that the null hypothesis is true and a highly improbable event has occurred, or that the null hypothesis is false.

However, people interpret the p-value in many incorrect ways, and try to draw other conclusions from p-values, which do not follow.

The p-value does not in itself allow reasoning about the probabilities of hypotheses; this requires multiple hypotheses or a range of hypotheses, with a prior distribution of likelihoods between them, as in Bayesian statistics, in which case one uses a likelihood function for all possible values of the prior, instead of the p-value for a single null hypothesis.

The p-value refers only to a single hypothesis, called the null hypothesis, and does not make reference to or allow conclusions about any other hypotheses, such as the alternative hypothesis in Neyman–Pearson statistical hypothesis testing. In that approach one instead has a decision function between two alternatives, often based on a test statistic, and one computes the rate of Type I and type II errors as α and β. However, the p-value of a test statistic cannot be directly compared to these error rates α and β – instead it is fed into a decision function.

There are several common misunderstandings about p-values.

The p-value is not the probability that the null hypothesis is true, nor is it the probability that the alternative hypothesis is false – it is not connected to either of these. In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of Bayesian and classical approaches shows that a p-value can be very close to zero while the posterior probability of the null is very close to unity (if there is no alternative hypothesis with a large enough a priori probability and which would explain the results more easily). This is Lindley's paradox. But there are also a priori probability distributions where the posterior probability and the p-value have similar or equal values.

The p-value is not the probability that a finding is "merely a fluke." As calculating the p-value is based on the assumption that every finding is a fluke (that is, the product of chance alone), it cannot be used to gauge the probability of a finding being true. The p-value is the chance of obtaining the findings we got (or more extreme) if the null hypothesis is true.

The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called prosecutor's fallacy.

The p-value is not the probability that replicating the experiment would yield the same conclusion. Quantifying the replicability of an experiment was attempted through the concept of p-rep. The significance level, such as 0.05, is not determined by the p-value. Rather, the significance level is decided by the person conducting the experiment (with the value 0.05 widely used by the scientific community) before the data are viewed, and is compared against the calculated p-value after the test has been performed. (However, reporting a p-value is more useful than simply saying that the results were or were not significant at a given level, and allows readers to decide for themselves whether to consider the results significant.)

The p-value does not indicate the size or importance of the observed effect. The two do vary together however–the larger the effect, the smaller sample size will be required to get a significant p-value (see effect size).