Category Archives: Statistics

Standard error of a difference of sample means

The variance of the difference of two random variables is the sum of the two variances. We sample 30 observations (usually large enough according to the CLT) from winter and not winter absolute daily price changes. The absolute daily changes appear log normal, but according to the Central Limit Theorem the distribution of the samples will be normal, assuming the sample size is large enough:

Now calculate the standard error of the difference of the sample means (or the sum of the sample means):

Calculate the difference in sample means and then calculate the t-statistic. Dividing a random variable (diff) by its own standard error gives a new random variable with a standard error of 1:

Tstat should be approximately normal with mean 0 and se 1. Assuming that tstat is normal, how often would a normally distributed random variable exceed tstat? The p-value is .107, not statistically significant at the common .05 level:

Central Limit Theorem simulation

Again, reworking examples from the great Data Analysis for the Life Sciences using commodities data. This example is interesting because the authors show how to use sapply, while before they used for loops. One of the great benefits of this book is the use of more advanced functional programming and iteration techniques, both of which are used in this example.

Here we are taking different sample sizes from the absolute values of the daily changes of winter natural gas and !winter natural gas and comparing the differences. We know that the 2010-today difference in the means is 0.0233, and that the standard deviation of the sample distribution should decrease as sample size increases.

This code uses sapply to apply each of the four Ns to the user-created function inside sapply. It then uses a for loop to create charts for the results from each of the four sample sizes, adding the sample average and standard deviation in each title.

Central Limit Theorem

Definition: when the sample size is large enough, the average of a random sample from a population follows a normal distribution centered at the population average, with a sample standard deviation (standard error) equal to the population standard deviation divided by the square root of the sample size.

The average of this ratio is approximated by a standard normal distribution:

In the natural gas example, we have two populations (winter and not winter), and we can define the null distribution as stating that the average daily price change of the two populations is the same. The null implies that the difference between the two average daily price changes is approximated by a normal distribution centered at zero and with a standard deviation equal to the square root of the two population variances divided by the square root of the sample size. This ratio is approximately N(0, 1):

This allows us to compute the percentage of price changes that should be within or outside certain standard deviations.

If we compare the results from sampling from the populations and taking the sample means and calculating a p-value using the normal approximation, the results are very similar:

The ratio above assumes we know the population standard deviations, but we usually don’t. The sample standard deviations are defined as:

This means that the ratio above is now (if the two sample sizes are the same):

The Central Limit Theorem says that when N is large enough this ratio is also approximated by a standard normal distribution, N(0, 1).

Population vs. Sample variance

Continuing with the natural gas data, we define our population as all Henry Hub daily prices since Jan 1, 2010. The formula for population variance is

When using the “var” R function to calculate population variance we have to correct by multiplying by (n-1)/n. This is because the “var” function calculates sample variance.

The standard deviation is just the square root of the variance.