Variance

We are interested in differences in people: human variability. The whole of statistics is built from that observation. So, we start with seeing how to talk about variability. We want to find a way of saying how much variability there is in a set of numbers. We want the measure of variability to:

  • reflect all of the differences in the data
  • get bigger as the variability is larger

We will use a simple maths convention to help us here. We have n data points. We will use the symbol x_i to refer to the ith data point, and i can be any number between 1 and n. So x_1 is the first data point.

Step 1: just look at all the differences we have
Let’s start with the first data point, x_1. A good measure of the difference between it and any other data point, x_i, is given by:
(x_1 -x_i)^2
So, the sum of all the differences between x_1 and all the other points can be written as:
\sum\limits_{i}^{n}(x_1-x_i)^2
Now we do a bit of algebra: we expand the expression and get:
\sum\limits_{i}^{n}(x_1^2-2x_1x_i+x_i^2)
and then a little further:
\sum\limits_{i}(x_1^2)-\sum\limits_{i}(2x_1x_i)+\sum\limits_{i}(x_i^2)
which can be simplified as:
nx_1^2-2x_1\sum\limits_{i}x_i+\sum\limits_{i}x_i^2
This is the sum of all the differences between x_1 and other points.

Now we do the same thing for all the other data points. We are going to end up with the sum of all the possible pairwise differences in our data. We will replace the subscript 1 by the subscript j, to indicate that we are now doing this for all data points:
\sum\limits_{j}(nx_j^2-2x_j\sum\limits_{i}x_i+\sum\limits_{i}x_i^2)
And we do the same routine, expand and then simplify to get this:
n\sum\limits_{j}x_j^2-2\sum\limits_{j}x_j\sum\limits_{i}x_i+n\sum\limits_{i}x_i^2
Since \sum\limits_{j}x_j=\sum\limits_{i}x_i, we can simplify further:
2n\sum\limits_{i}x_i^2-2(\sum\limits_{i}x_i)^2

One last step and we are done. We have counted every difference twice, once from i to j and once from j to i. So we must divide by 2. We must also divide by n the number of points; we divide once for each summation, so we divide by n twice. Doing this gives us a simple formula for the variability in our data:
\sum\nolimits x^2-\left(\frac{\sum\nolimits x}{n}\right)^2

This formula calculates a quantity known as the Variance. It has some amazingly useful properties.

Step 2: differences as deviations
We can also think of each data point as being a deviation from some typical value. The only hard part of this is finding what to use as the typical. As if by luck, it is going to turn out to be the mean of the original numbers. 

We will use the symbol X for the typical value. The squared deviation is then:
\left(x_i-X\right)^2
and the average squared deviation is:
\frac{1}{n}\sum\limits_{i}^n\left(x_i-X\right)^2

Now we want to make sure this average isn’t arbitrarily big, which it might be if we set X to some arbitrary number. So we are going to find what we should set X to be, so that the overall average squared deviation is at its smallest. We do this with some calculus which has a standard method for finding the minimum of a function like this.
1. We differentiate the formula:
\frac{\partial a}{\partial X}=\frac{1}{n}\left(0-2\sum\nolimits x _i+ 2\sum X\right) 
2. We equate \frac{\partial a}{\partial X} to zero (to find the minimum of a):
\frac{\partial a}{\partial X}=\frac{1}{n}\left(0-2\sum\nolimits x _i+ 2\sum X\right)=0 
3. We solve this for X:
i. we rearrange:
\sum\nolimits x _i= \sum\nolimits X 
ii. we replace ΣX by nX:
\sum\nolimits x _i= nX 
iii. and simplify:
X=\frac{\sum\nolimits x _i}{n} 

So the typical vaue that we calculate deviations from turns out to be what we normally call the mean:
mean(x_i)=\frac{\sum\nolimits x _i}{n} 

The sum of squared deviations about the mean is the variance that we have already found, but written differently:

Note that this expression implies the order you do things:
1. subtract mean(x) from each xi to get the set of deviations
2. square the result to get the set of squared deviations
3. sum the set to get single value: the variance

Finally, we can simplify the next few steps with one more definition. We can use this for the set of deviations of the points xi from the mean:

This allows us to rewrite the formula for a mean in a way that shows us that the sum of deviations from the mean is always zero:

QED

Comment
This variance is going to be central to everything we do. Although we found the mean along the way, we won’t really ever need it again. (And yes, most textbooks make a lot of the mean – they are just wasting your time).