Monthly Archives: June 2021

Central Limit Theorem

Definition: when the sample size is large enough, the average of a random sample from a population follows a normal distribution centered at the population average, with a sample standard deviation (standard error) equal to the population standard deviation divided by the square root of the sample size.

The average of this ratio is approximated by a standard normal distribution:

In the natural gas example, we have two populations (winter and not winter), and we can define the null distribution as stating that the average daily price change of the two populations is the same. The null implies that the difference between the two average daily price changes is approximated by a normal distribution centered at zero and with a standard deviation equal to the square root of the two population variances divided by the square root of the sample size. This ratio is approximately N(0, 1):

This allows us to compute the percentage of price changes that should be within or outside certain standard deviations.

If we compare the results from sampling from the populations and taking the sample means and calculating a p-value using the normal approximation, the results are very similar:

The ratio above assumes we know the population standard deviations, but we usually don’t. The sample standard deviations are defined as:

This means that the ratio above is now (if the two sample sizes are the same):

The Central Limit Theorem says that when N is large enough this ratio is also approximated by a standard normal distribution, N(0, 1).

Population vs. Sample variance

Continuing with the natural gas data, we define our population as all Henry Hub daily prices since Jan 1, 2010. The formula for population variance is

When using the “var” R function to calculate population variance we have to correct by multiplying by (n-1)/n. This is because the “var” function calculates sample variance.

The standard deviation is just the square root of the variance.

Natural gas winter volatility p-value simulation

Natural gas prices are more volatile during the winter heating season. I’ve adapted the code from Data Analysis for the Life Sciences to use natural gas daily price change and simulate a p-value.

Actual monthly natural gas volatility since 2010:

ng1 %>% group_by(month) %>% summarize(st.dev = sd(ng.diff)) #find the st. dev. of each month’s prices

Separate daily natural gas prices into winter and not winter:

ng.not.winter <- ng1 %>% dplyr::filter(!month %in% c(“1”, “2”)) #control group
ng.winter <- ng1 %>% dplyr::filter(month %in% c(“1”, “2”)) #treatment group

Find the real difference in price volatility (standard deviation of price):

month.diff <- sd(ng.winter$ng1) – sd(ng.not.winter$ng1)

Create a null distribution by sampling:

set.seed(1)
n <- 10000
samp.size <- 12
null <- vector(“numeric”, n) #create a vector for the null distribution, holding the differences between means of samples from the same population

for(i in 1:n) {
ng.not.winter.sample <- sample(ng.not.winter$ng1, samp.size) #sampling from population of control
ng.winter.sample <- sample(ng.not.winter$ng1, samp.size) #sampling from population of control
null[i] <- sd(ng.winter.sample) – sd(ng.not.winter.sample)
}

mean(null >= month.diff) #p-value

Not below 5%, but not bad. The p-value would decrease if we increased the sample size.

10% of samples from the control group (non-winter) have differences greater than the true difference. This is the p-value.

hist(null, freq=TRUE)
abline(v=month.diff, col=”red”, lwd=2)

The histogram looks normal, so using a normal approximation should give a similar answer:

1 – pnorm(month.diff, mean(null), sd(null))

Return calculations in R

Discrete returns the long way. This assumes the prices are in an xts object. stats::lag is specified since dplyr can mask the stats lag function with its own lag function:

head(diff(prices.xts)[2:n-1]/stats::lag(prices.xts, 1))

The easy way, with the PerformanceAnalytics package:

library(PerformanceAnalytics)
head(Return.calculate(prices.xts, method = “discrete”))

Check they’re the same:

Log returns the long way, this time removing NAs:

na.omit(diff(log(prices.xts))

With PerformanceAnalytics:

prices.ret <-na.omit((Return.calculate(prices.xts, method = “log”)))

Convert to a data.frame, rename date column, remove NAs if any:

data.frame(index(prices.ret), coredata(prices.ret)) %>% dplyr::rename(date = index.prices.ret.) %>% na.omit()