Screen Link:
https://app.dataquest.io/m/305/the-mean/11/the-sample-mean-as-an-unbiased-estimator
This is a mission from 305-11. In the Learn section, we had an example of X = [0, 3, 6]
and calculated the mean of the sample means equals the population mean. We were told that this holds true for any other distribution of real numbers.
While it’s intuitively true, it feels a bit muddled in my head. So I decided to try to prove it mathematically. I had fun and definitely have more clarity after this. It’s a bit long, I might not have done a good job writing it down, but I still want to share it with you guys.
Before we go into any code, imagine an extreme case, where every element in the population
is the same number, which is inevitably the mean
of the population
. So all the samples
will be the same, have the same mean
, and of course, the mean of the sample means will be the same as the population mean.
While this is already proving our topic here, I did do a little more proving based on my solution below.
I took a different route in this mission, rather than listing all the combinations of samples like the answer, I used for loops to generate the samples. I also added variables like iteration
to count the number of iterations in the loops below.
My Code:
population = [3, 7, 2]
means = []
iteration = 0
samples = []
for i in population:
for n in population:
if i != n:
iteration += 1
means.append((i+n)/2)
samples.append([i, n])
sample_mean = sum(means)/len(means)
unbiased = (sum(population)/len(population)) == sample_mean
I experimented and added numbers to the population
list. Here are my steps of proving The mean of the sample means is equal to the population mean:
dictionary:
population
: Population the samples are from
samples
: A list of all combination of samples from population
iteration
: The number of iterations in the loops above. Also equals len(samples)
.
pop_len
: Population length
sample_size
: Size of each sample. sample_size = 2
in this mission
means
: A list of sample means.
element_iter_times
: The times each element in the population gets picked. It’s the same for every element.
Steps:
-
Since every element in the population gets picked equal times:
element_iter_times
equalsiteration * sample_size / pop_len
-
So
sum(means)
is equal tosum(element_iter_times * population) / sample_size
which can also be written assum(population) * element_iter_times / sample_size
. -
Let’s plug in the equvalent of
element_iter_times
in the equation above:
sum(means) == (sum(population) * (iteration * sample_size / pop_len)) / sample_size
which equalssum(population) * iteration / pop_len
-
We already know that
len(means)
equalsiteration
-
So
sum(means) / len(means)
is equal to(sum(population) * iteration / pop_len) / iteration
.
Viola! Here we go,sum(means) / len(means) == sum(population) / pop_len
!
On a side note, you will find iteration == pop_len * sample_size - pop_len
for non-replacement sampling, and iteration == pop_len * sample_size
for replacement sampling.
I hope this an interesting read and helps fellow learners like me who find ’The mean of the sample means is equal to the population mean’ as confusing but true as it sounds.