Stratified Sampling

Screen Link:
I want to know how can I improv this code any tips Also the sample size I choose 7,7,1 But the I don’t like the results

My Code:

min_in = wnba['MIN'].value_counts(bins = 3, normalize = True)* 100

under_347 = wnba[wnba["MIN"] <= 347.333]    
btw_347_682 = wnba[(wnba["MIN"] >  347.333)&~(wnba["MIN"] <= 682.667)]   
above_682 = wnba[wnba["MIN"] > 682.667]

proptions = []
for i in range(100):
    sample_under = under_1018["PTS"].sample(7,random_state=i)
    sample_btw = btw_10_347["PTS"].sample(7,random_state=i)
    sample_above = above_682["PTS"].sample(1,random_state=i)
    final_sample = pd.concat([sample_under,sample_btw,sample_above])


What I expected to happen:

What actually happened: 

Replace this line with the output/error

<!--Enter other details below: -->

Hi @hshf1992

Please provide a mission link and tag your topic per these guidelines so that we can better assist you. In addition, please fill in what you expected and the result of the code above. Thanks!

Hi @hshf1992

Can you please add a link to the misson, next time you pose a question (guidelines )? This makes things easier for everybody. Thank you.

I assume you are refering to Choosing the right strata.

Choosing numbers of samples per bin: You want to do this proportionally, meaning that the size of the sample for each stratum relative to the total sample size is the same as the size of each group (bin) relative to the total number of observations in your original data.

relative_sizes = wnba['MIN'].value_counts(bins = 3, normalize = True).sort_index()

(10.994, 347.333]     0.335664
(347.333, 682.667]    0.349650
(682.667, 1018.0]     0.31468

This tells you that every group accounts for roughly 1/3 of overall observations in the data. Consequently, I would start out with drawing equal sized samples from each stratum. Say you want to have a sample with 15 observations in total, this would mean a sample size of 5 for each stratum.

For the code: If you think about this problem, as examplified in your/the DQ solution then I don’t see a lot of space for improvement. This being said, you could try to approach this differently. Instead of creating 3 separate dataframes - one for each bin - you could create a variable, which indicates the bin/group membership of every observation using pd.cut() (Documentation).

wnba['Min Cat'] = pd.cut(wnba['MIN'], bins=3)

Then you could iterate over all possible groups, subset the original dataframe accordingly and draw your samples.

wnba_selection = wnba[wnba['Min Cat'] == group]
sample = wnba_selection['PTS'].sample(prop, random_state=i)

Finally, you just need to tie everything together and calculate the means - similiar to what you did in your code.

If you want to play around, ask if you need help. If you are interested in the full code, I can provide this as well.


I well do next time thanks.