Upcoming talks and demos:

Codemotion - Amsterdam - 16 May
DevDays - Vilnius - 17 May
Strata - London - 22 May

View Natalino Busa's profile on LinkedIn
Principal Data Scientist, Director for Data Science, AI, Big Data Technologies. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Tuesday, October 15, 2013

Sample statistics using R, ggplot2 and monte carlo simulations

Given N samples of a population normally distributed, we can define the mean from the sample statistics as:

$$ \bar { x } =\quad \frac { 1 }{ N } \sum _{ i=1 }^{ N }{ { x }_{ i } } $$

In mathematical terms, given a random variable X with distribution F, a random sample of length N is a set of N independent, identically distributed (iid) random variables with distribution F.

In our case, provided that with select our N samples randomly, each of these samples is itself a random variable normally distributed. This means that the sample mean is also itself a random variable.

Montecarlo simulation

We can use the sample mean as an estimator for the true mean value for the serie. Let's create a set of 4 samples and let's calculate the statistics of the sample mean as a random variable:

Sample mean

The sample mean is a random variable, and its outcome can be used as an estimator for the underlying actual mean. Why the sample mean distribution behaves as a normal distribution with standard deviation of 1/2? 

In this case, we know that the population has a normal distribution, therefore u=0, sd=1. We also know that we have taken a set of 4 sample to build our sample mean statistics. The mean is a random variable with mean and variance according to the following formulas:

${ Mean }\left( \overline { X }  \right) ={ E }\left[ \frac { 1 }{ N } \sum _{ i=1 }^{ N } X_{ i } \right] =\frac { N }{ N } \mu =\mu $

$ \operatorname{Var}\left(\overline{X}\right) = \operatorname{Var}\left(\frac {1} {N}\sum_{i=1}^N X_i\right) = \frac {1} {N^2}\sum_{i=1}^N \operatorname{Var}\left(X_i\right) = \frac {\sigma^2} {N} $


The bias  defined as the expected error of the sample mean  minus the true mean is zero.

$ Bias(\bar { x } )\quad =\quad E[\bar { x } -\quad \mu ]\quad =\quad \\ Bias(\bar { x } )\quad =\quad E[\frac { 1 }{ N } \sum _{ i=1 }^{ N }{ { x }_{ i } } -\quad \mu ]\quad =\quad E[\frac { 1 }{ N } \sum _{ i=1 }^{ N }{ { x }_{ i } } ]\quad -\quad \mu \quad \\ Bias(\bar { x } )\quad =\quad \frac { N }{ N } \mu \quad -\mu \quad =\quad 0 $

Standard Error

This formula was discovered by Bienaymé in 1853. It states that the variance decreases with the square root of the number of samples taken to build the estimator. Since in our case N=4, it means the the standard deviation of the mean is 1/sqrt(4), hence 0.5.

$$ s\quad =\quad \frac { \sigma  }{ \sqrt { N }  }  $$

The standard error of the sample mean is indeed the square root of variance of the sample mean:

$$ Precision(\bar{x})\quad =\quad SE(\bar { x } )\quad =\quad \sqrt { Var(\bar { x } ) } \quad=\quad \sqrt { E[(x-E[\bar{x}])^2] } = \quad \frac { \sigma  }{ \sqrt { N }  } = s $$

Estimation of the mean

If the mean is unknown, we can use the standard mean to estimate the mean. In this case we can depend on the statistics of the sample to assess the true mean. We have just seen the the sample mean is unbiased, but we have also seen that our mean estimation can have a certain error, (the standard error).

In general, the squared error that we commit estimating the mean is:

$ MSE(\bar { x } )\quad =\quad E[{ (\bar { x } -\quad \mu ) }^{ 2 }]\quad =\quad E[{ { \bar { x }  }^{ 2 }-\quad 2\bar { x } \mu \quad +\quad { \mu  }^{ 2 } }]\\ MSE(\bar { x } )\quad =\quad Var({ \bar { x }  }^{ 2 })\quad +\quad { (E(\bar { x } -\quad \mu )) }^{ 2 }\quad \\ MSE(\bar { x } )\quad =\quad SE({ \bar { x }  })\quad +\quad { (Bias(\bar { x } ,\quad \mu )) }^{ 2 } $

Considered the mean sample statistics, there is a probability of 95% (2 sigmas) that the mean of four sample would follow in the range:

$$ Mean(\bar { X } )\quad \pm \quad 2\quad SD(\bar { X } ) $$

See the estimation here below from the above monte carlo simulation: