283-9: Choosing the Right Strata

Screen Link: https://app.dataquest.io/m/283/sampling/9/choosing-the-right-strata

My Code:

print(wnba['MIN'].value_counts(bins = 3, normalize = True)*100)

bin1 = wnba[wnba["MIN"]<=347]
bin2 = wnba[(wnba["MIN"]>347) & (wnba["MIN"]<=682)]
bin3 = wnba[wnba["MIN"]>682]

print("bin1: ",len(bin1)) #4.2
print("bin2: ",len(bin2)) #4.0
print("bin3: ",len(bin3)) #3.7

stratum = [(bin1, 4),(bin2, 4), (bin3,4)]


means = []

for s in range(100):
    sample1 = bin1["PTS"].sample(4, random_state=s)
    sample2 = bin2["PTS"].sample(4, random_state=s)
    sample3 = bin3["PTS"].sample(4, random_state=s)
    result =pd.concat([sample1, sample2, sample3])
    means.append(result.mean())
        
print(len(means))

plt.scatter(x = range(1,101), y=means)
plt.axhline(wnba.PTS.mean())
plt.show()

My plot doesn´t seem to match the given plot by DQ. I´m guessing I made a mistake while sampling (the loop and all should be ok).
In my understanding all three bins are about the same size proportionwise. Since the sample size is supposed to be 12 I figured I had to take 4 samples from each bin.
Anyone able to help me out on this?

Much appreciated
many thanks in advance
Marina

Hi Marina,

I don’t have the codes used by the content author to generate those plots, However, here are some codes that closely resembles it:

Stratified Sampling (Games Played), Sample = 10:

Screenshot-2019-10-17-at-2-03-47-PM.png

under_12 = wnba[wnba['Games Played'] <= 12]
btw_13_22 = wnba[(wnba['Games Played'] > 12) & (wnba['Games Played'] <= 22)]
over_23 = wnba[wnba['Games Played'] > 22]

proportional_sampling_means = []

for i in range(100):
    sample_under_12 = under_12['PTS'].sample(1, random_state = i)
    sample_btw_13_22 = btw_13_22['PTS'].sample(2, random_state = i)
    sample_over_23 = over_23['PTS'].sample(7, random_state = i)
    
    final_sample = pd.concat([sample_under_12, sample_btw_13_22, sample_over_23])
    proportional_sampling_means.append(final_sample.mean())
    
plt.scatter(range(1,101), proportional_sampling_means)
plt.axhline(wnba['PTS'].mean())
plt.axis([-5, 105, 100, 350])

Simple Random Sampling, Sample = 10:

Screenshot-2019-10-17-at-2-12-47-PM.png

sampling_means = []

for i in range(100):
    final_sample = wnba['PTS'].sample(10, random_state = i)
    sampling_means.append(final_sample.mean())
    
plt.scatter(range(1,101), sampling_means)
plt.axhline(wnba['PTS'].mean())

Stratified Sampling (Minutes Played), Sample = 12:

Screenshot-2019-10-17-at-2-41-51-PM.png

under_12 = wnba[wnba['MIN'] <= 350]
btw_13_22 = wnba[(wnba['MIN'] > 350) & (wnba['MIN'] <= 700)]
over_23 = wnba[wnba['MIN'] > 700]

proportional_sampling_means = []

for i in range(100):
    sample_under_12 = under_12['PTS'].sample(4, random_state = i)
    sample_btw_13_22 = btw_13_22['PTS'].sample(4, random_state = i)
    sample_over_23 = over_23['PTS'].sample(4, random_state = i)
    
    final_sample = pd.concat([sample_under_12, sample_btw_13_22, sample_over_23])
    proportional_sampling_means.append(final_sample.mean())
    
plt.scatter(range(1,101), proportional_sampling_means)
plt.axhline(wnba['PTS'].mean())
plt.axis([-5, 105, 100, 350])

Simple Random Sampling, Sample = 12:
Screenshot-2019-10-17-at-2-43-27-PM.png

sampling_means = []

for i in range(100):
    final_sample = wnba['PTS'].sample(12, random_state = i)
    sampling_means.append(final_sample.mean())
    
plt.scatter(range(1,101), sampling_means)
plt.axhline(wnba['PTS'].mean())
plt.axis([-5, 105, 100, 350])

Stratified Sampling (Games Played), Sample = 12:

Screenshot-2019-10-17-at-2-54-25-PM.png

under_12 = wnba[wnba['Games Played'] <= 12]
btw_13_22 = wnba[(wnba['Games Played'] > 12) & (wnba['Games Played'] <= 22)]
over_23 = wnba[wnba['Games Played'] > 22]

proportional_sampling_means = []

for i in range(100):
    sample_under_12 = under_12['PTS'].sample(1, random_state = i)
    sample_btw_13_22 = btw_13_22['PTS'].sample(2, random_state = i)
    sample_over_23 = over_23['PTS'].sample(9, random_state = i)
    
    final_sample = pd.concat([sample_under_12, sample_btw_13_22, sample_over_23])
    proportional_sampling_means.append(final_sample.mean())
    
plt.scatter(range(1,101), proportional_sampling_means)
plt.axhline(wnba['PTS'].mean())
plt.axis([-5, 105, 100, 350])

Hope this helps :slightly_smiling_face:

Best,
Sahil

1 Like

Thanks Sahil! That helps a lot :slight_smile:

2 Likes

I am not sure this is calculating the mean for ‘PTS’ here. The final_sample is still a dataframe so all the numerical variables will be averaged. I’m sure this is just a typo or the plot wouldn’t work or am I missing something here ?