Prove 'The mean of the sample means is equal to the population mean'

Screen Link:
https://app.dataquest.io/m/305/the-mean/11/the-sample-mean-as-an-unbiased-estimator

This is a mission from 305-11. In the Learn section, we had an example of X = [0, 3, 6] and calculated the mean of the sample means equals the population mean. We were told that this holds true for any other distribution of real numbers.

While it’s intuitively true, it feels a bit muddled in my head. So I decided to try to prove it mathematically. I had fun and definitely have more clarity after this. It’s a bit long, I might not have done a good job writing it down, but I still want to share it with you guys. :relaxed:

Before we go into any code, imagine an extreme case, where every element in the population is the same number, which is inevitably the mean of the population. So all the samples will be the same, have the same mean, and of course, the mean of the sample means will be the same as the population mean. :grinning:

While this is already proving our topic here, I did do a little more proving based on my solution below.

I took a different route in this mission, rather than listing all the combinations of samples like the answer, I used for loops to generate the samples. I also added variables like iteration to count the number of iterations in the loops below.

My Code:

population = [3, 7, 2]
means = []
iteration = 0
samples = []

for i in population:
    for n in population:
        if i != n:
            iteration += 1
            means.append((i+n)/2)
            samples.append([i, n])
        
sample_mean = sum(means)/len(means)
unbiased = (sum(population)/len(population)) == sample_mean

I experimented and added numbers to the population list. Here are my steps of proving The mean of the sample means is equal to the population mean:

dictionary:
population: Population the samples are from
samples: A list of all combination of samples from population
iteration: The number of iterations in the loops above. Also equals len(samples).
pop_len: Population length
sample_size: Size of each sample. sample_size = 2 in this mission
means: A list of sample means.
element_iter_times: The times each element in the population gets picked. It’s the same for every element.

Steps:

  • Since every element in the population gets picked equal times:
    element_iter_times equals iteration * sample_size / pop_len

  • So sum(means) is equal to sum(element_iter_times * population) / sample_size which can also be written as sum(population) * element_iter_times / sample_size.

  • Let’s plug in the equvalent of element_iter_times in the equation above:
    sum(means) == (sum(population) * (iteration * sample_size / pop_len)) / sample_size which equals sum(population) * iteration / pop_len

  • We already know that len(means) equals iteration

  • So sum(means) / len(means) is equal to (sum(population) * iteration / pop_len) / iteration.
    Viola! Here we go, sum(means) / len(means) == sum(population) / pop_len!

On a side note, you will find iteration == pop_len * sample_size - pop_len for non-replacement sampling, and iteration == pop_len * sample_size for replacement sampling.

I hope this an interesting read and helps fellow learners like me who find ’The mean of the sample means is equal to the population mean’ as confusing but true as it sounds. :joy:

6 Likes

Really great share, Thank you :smile:

1 Like

Glad you like it! :relaxed:

1 Like