# 283-9: Choosing the Right Strata

My Code:

``````print(wnba['MIN'].value_counts(bins = 3, normalize = True)*100)

bin1 = wnba[wnba["MIN"]<=347]
bin2 = wnba[(wnba["MIN"]>347) & (wnba["MIN"]<=682)]
bin3 = wnba[wnba["MIN"]>682]

print("bin1: ",len(bin1)) #4.2
print("bin2: ",len(bin2)) #4.0
print("bin3: ",len(bin3)) #3.7

stratum = [(bin1, 4),(bin2, 4), (bin3,4)]

means = []

for s in range(100):
sample1 = bin1["PTS"].sample(4, random_state=s)
sample2 = bin2["PTS"].sample(4, random_state=s)
sample3 = bin3["PTS"].sample(4, random_state=s)
result =pd.concat([sample1, sample2, sample3])
means.append(result.mean())

print(len(means))

plt.scatter(x = range(1,101), y=means)
plt.axhline(wnba.PTS.mean())
plt.show()

``````

My plot doesn´t seem to match the given plot by DQ. I´m guessing I made a mistake while sampling (the loop and all should be ok).
In my understanding all three bins are about the same size proportionwise. Since the sample size is supposed to be 12 I figured I had to take 4 samples from each bin.
Anyone able to help me out on this?

Much appreciated
many thanks in advance
Marina

Hi Marina,

I don’t have the codes used by the content author to generate those plots, However, here are some codes that closely resembles it:

Stratified Sampling (Games Played), Sample = 10: ``````under_12 = wnba[wnba['Games Played'] <= 12]
btw_13_22 = wnba[(wnba['Games Played'] > 12) & (wnba['Games Played'] <= 22)]
over_23 = wnba[wnba['Games Played'] > 22]

proportional_sampling_means = []

for i in range(100):
sample_under_12 = under_12['PTS'].sample(1, random_state = i)
sample_btw_13_22 = btw_13_22['PTS'].sample(2, random_state = i)
sample_over_23 = over_23['PTS'].sample(7, random_state = i)

final_sample = pd.concat([sample_under_12, sample_btw_13_22, sample_over_23])
proportional_sampling_means.append(final_sample.mean())

plt.scatter(range(1,101), proportional_sampling_means)
plt.axhline(wnba['PTS'].mean())
plt.axis([-5, 105, 100, 350])
``````

Simple Random Sampling, Sample = 10: ``````sampling_means = []

for i in range(100):
final_sample = wnba['PTS'].sample(10, random_state = i)
sampling_means.append(final_sample.mean())

plt.scatter(range(1,101), sampling_means)
plt.axhline(wnba['PTS'].mean())
``````

Stratified Sampling (Minutes Played), Sample = 12: ``````under_12 = wnba[wnba['MIN'] <= 350]
btw_13_22 = wnba[(wnba['MIN'] > 350) & (wnba['MIN'] <= 700)]
over_23 = wnba[wnba['MIN'] > 700]

proportional_sampling_means = []

for i in range(100):
sample_under_12 = under_12['PTS'].sample(4, random_state = i)
sample_btw_13_22 = btw_13_22['PTS'].sample(4, random_state = i)
sample_over_23 = over_23['PTS'].sample(4, random_state = i)

final_sample = pd.concat([sample_under_12, sample_btw_13_22, sample_over_23])
proportional_sampling_means.append(final_sample.mean())

plt.scatter(range(1,101), proportional_sampling_means)
plt.axhline(wnba['PTS'].mean())
plt.axis([-5, 105, 100, 350])
``````

Simple Random Sampling, Sample = 12: ``````sampling_means = []

for i in range(100):
final_sample = wnba['PTS'].sample(12, random_state = i)
sampling_means.append(final_sample.mean())

plt.scatter(range(1,101), sampling_means)
plt.axhline(wnba['PTS'].mean())
plt.axis([-5, 105, 100, 350])
``````

Stratified Sampling (Games Played), Sample = 12: ``````under_12 = wnba[wnba['Games Played'] <= 12]
btw_13_22 = wnba[(wnba['Games Played'] > 12) & (wnba['Games Played'] <= 22)]
over_23 = wnba[wnba['Games Played'] > 22]

proportional_sampling_means = []

for i in range(100):
sample_under_12 = under_12['PTS'].sample(1, random_state = i)
sample_btw_13_22 = btw_13_22['PTS'].sample(2, random_state = i)
sample_over_23 = over_23['PTS'].sample(9, random_state = i)

final_sample = pd.concat([sample_under_12, sample_btw_13_22, sample_over_23])
proportional_sampling_means.append(final_sample.mean())

plt.scatter(range(1,101), proportional_sampling_means)
plt.axhline(wnba['PTS'].mean())
plt.axis([-5, 105, 100, 350])
``````

Hope this helps Best,
Sahil

1 Like

Thanks Sahil! That helps a lot 2 Likes

I am not sure this is calculating the mean for ‘PTS’ here. The final_sample is still a dataframe so all the numerical variables will be averaged. I’m sure this is just a typo or the plot wouldn’t work or am I missing something here ?