Stratified sampling , Screen 4

How do we select the intervals for bins in general cases for a good results. Here, it is <13, 13-22, >22. If we have selected <10, 10-20 , >20 , will it work properly.

Hi @sharathnandalike. It is up to the analyst to determine the intervals/bins that are appropriate to the question being asked. For this particular problem, the bins were selected because the range of the data is from a minimum of 2 Games_Played to a maximum of 32. Splitting the data into three bins/stratum means that each bin will span approximately 10 games.

Using the bins <10, 10-20 , >20 will give you a different answer because the <10 bin will span only 8 games, whereas the >20 bin will span approximately 12 games. The code will work, but the results will be different. This approach could be valid if the analyst has a valid reason for structuring the bins this way and communicates this to the reader.

Thanks for asking and let me know if you have any other questions.

Hi Casey,

What is the normal technique for creating no. of bins for huge nos.for good results . For Eg; points scored (PTS) has big nos in thousands. Interval of 10 will create lot of bins & this will be difficult to analyse.

Hi @sharathnandalike. Good question. There is no one-size-fits-all answer here. The normal technique is that the data analyst determines how to present the data. In his book R for Data Science, Hadley Wickham says in the chapter Visualizing Distributions that:

You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns.

Instead of binwidths, let me speak for a moment about a related topic which is the total number of bins to include in a histogram. The default number of bins created in a ggplot2 histogram is 30. The wnba.csv dataset has less than 200 data points, so 30 bins might spread out the data too much because it is not a large dataset. Even the official tidyverse documentation says that 30 bins is not a good default:

By default, the underlying computation ( stat_bin() ) uses 30 bins; this is not a good default, but the idea is to get you experimenting with different bin widths. You may need to look at a few to uncover the full story behind your data.

In most cases, it is acceptable for you as the analyst to make a determination about the binwidth, or the number of bins, best suited to the data that you are working with. There are, however, statistical approaches for determining the number of bins. This post even provides code to implement one such rule, known as the Freedman-Diaconis rule.

I hope this helps. Let me know if you have any further questions.

1 Like