Hi @sharathnandalike. Thanks for the clarification. You raise a good point…
for a manageable , smaller dataset , why to sample & calculate the mean & do the plotting to compare the population mean with sample mean which is a cumbersome task.
The reason we do it in this course with the wnba.csv
dataset is to (1) develop intuition around sample sizes - that larger sample sizes generally yield more accurate results about the population, and (2) that stratified random sampling can yield even more accurate results than simple random sampling. In most of the plots we are able to plot the population mean using the blue line, which allows us to compare all sampling results to an estimate derived from the entire population. We do this for teaching purposes to develop intuition around the topic of sampling.
This becomes more relevant when you think of this in the context of the analyst that works for the company with 50,000 employees around the globe. This analyst would want to collect as many samples as possible, but the number of samples they are able to take (or will receive from employees) is limited to some number that is practical and cost-effective to administer. Given this, they would likely want to also employ stratified random sampling techniques when they set up their survey design.
To your next question,
What is the standard total population size for sampling .
There is no standard size for sampling from a population. It all depends on the application. The number of samples required for an academic study might be greater than for a business application, because in business it may be more important to be efficient and cost-effective, whereas in academia precision/accuracy may be the priority.
There is always a balance between how many samples can be collected versus the resources available to collect and analyze them. Sampling could be “expensive” in terms of time to collect, time to analyze, or perhaps each sample may cost x
amount of money to analyze in a laboratory.
You raise a really good question here. Sampling design is often a very important part of the process and a topic that deserves a lot of attention up-front. For example, in my personal experience in graduate school, sampling design was a 12-week course required for all graduate students when they started the program. There are statistical methods for determining sample sizes given a required confidence interval.
A Data Scientist here at Dataquest is working on an audit of our statistics curriculum. I’ll share this post with them because this topic is worth considering for future missions/courses. Thanks for your questions!