Sampling in R ,

This a little off-beat query. After going thru some Statistics , ONE QUESTION COMES to mind with regard to descriptive statistics. For Eg; when we are calculating the population mean anyway & getting the right result ,why we want to sample 100 times to get an incorrect result sometimes & also test it for spread using plots.
Why go thru all this cumbersome task of sampling when we are doing this for population anyhow.

It is ok if we are not able to work with the total population, in case . But, in ‘wnba.csv’ , we have the population data & we have calculated the mean (PTS=POINTS SCORED ) in a season.

Second question is in Variables in Statistics (next mission , Screen-2 ), it is mentioned in a comparison table that we may use words for quantitative variables. How .

Hi @sharathnandalike.

Regarding your first question, in the real-world it is often difficult or impossible to work with a true population dataset. This all depends on the application, or field of study, of course. We use the wnba.csv dataset for teaching purposes because it allows us to perform statistical samples of different sizes against this smaller dataset where we have the population, and thus can compare our results to the population mean, for example.

Regarding your second question, as stated on that screen:

Height, for example, can be described using real numbers, like in our dataset, but it can also be described using labels like “tall” or “short.”

One way to think of this is that you can use a quantitative information to “bin” the data using the cut() function, for example and then use words to describe each bin.

Hi Casey,

The1st query was not answered completely. I mean, for a manageable , smaller dataset , why to sample & calculate the mean & do the plotting to compare the population mean with sample mean which is a cumbersome task.

Sampling is ok for big dataset , I understand. What is the standard total population size for sampling .

Hi @sharathnandalike. Thanks for the clarification. You raise a good point…

for a manageable , smaller dataset , why to sample & calculate the mean & do the plotting to compare the population mean with sample mean which is a cumbersome task.

The reason we do it in this course with the wnba.csv dataset is to (1) develop intuition around sample sizes - that larger sample sizes generally yield more accurate results about the population, and (2) that stratified random sampling can yield even more accurate results than simple random sampling. In most of the plots we are able to plot the population mean using the blue line, which allows us to compare all sampling results to an estimate derived from the entire population. We do this for teaching purposes to develop intuition around the topic of sampling.

This becomes more relevant when you think of this in the context of the analyst that works for the company with 50,000 employees around the globe. This analyst would want to collect as many samples as possible, but the number of samples they are able to take (or will receive from employees) is limited to some number that is practical and cost-effective to administer. Given this, they would likely want to also employ stratified random sampling techniques when they set up their survey design.

To your next question,

What is the standard total population size for sampling .

There is no standard size for sampling from a population. It all depends on the application. The number of samples required for an academic study might be greater than for a business application, because in business it may be more important to be efficient and cost-effective, whereas in academia precision/accuracy may be the priority.

There is always a balance between how many samples can be collected versus the resources available to collect and analyze them. Sampling could be “expensive” in terms of time to collect, time to analyze, or perhaps each sample may cost x amount of money to analyze in a laboratory.

You raise a really good question here. Sampling design is often a very important part of the process and a topic that deserves a lot of attention up-front. For example, in my personal experience in graduate school, sampling design was a 12-week course required for all graduate students when they started the program. There are statistical methods for determining sample sizes given a required confidence interval.

A Data Scientist here at Dataquest is working on an audit of our statistics curriculum. I’ll share this post with them because this topic is worth considering for future missions/courses. Thanks for your questions!

Thanks a lot for your reply & appreciation, Casey.

One more follow up question - what size of population we must go for sampling.