Wrong plot at "Choosing the right strata-Sampling (screen 9)" and duplicate data

Screen Link: https://app.dataquest.io/m/283/sampling/9/choosing-the-right-strata
Hello. I’m not getting the plot I expected.

My Code:

strat1=wnba[wnba['MIN']<=347]
strat2=wnba[(wnba['MIN']>348) & (wnba['MIN']<=682)]
strat3=wnba[wnba['MIN']>682]
proportional_sampling_means=[]
for i in range(100):
    sample1=strat1.sample(4,random_state=i)
    sample2=strat2.sample(4,random_state=i)
    sample3=strat3.sample(4,random_state=i)
    
    final_sample=pd.concat([sample1,sample2,sample3])
    proportional_sampling_means.append(final_sample['PTS'].mean())
plt.scatter(range(1,101),proportional_sampling_means)
plt.axhline(wnba['PTS'].mean())

What I expected to happen:
Screenshot-2019-10-17-at-2-41-51-PM

What actually happened:
reality

I’ve found this post 283-9: Choosing the Right Strata. Sahil provided the following code for the proper plot output:

under_12 = wnba[wnba['MIN'] <= 350]
btw_13_22 = wnba[(wnba['MIN'] > 350) & (wnba['MIN'] <= 700)]
over_23 = wnba[wnba['MIN'] > 700]

proportional_sampling_means = []

for i in range(100):
    sample_under_12 = under_12['PTS'].sample(4, random_state = i)
    sample_btw_13_22 = btw_13_22['PTS'].sample(4, random_state = i)
    sample_over_23 = over_23['PTS'].sample(4, random_state = i)
    
    final_sample = pd.concat([sample_under_12, sample_btw_13_22, sample_over_23])
    proportional_sampling_means.append(final_sample.mean())
    
plt.scatter(range(1,101), proportional_sampling_means)
plt.axhline(wnba['PTS'].mean())
plt.axis([-5, 105, 100, 350])

Before looking at sahil’s code, I’ve tried increasing the sample size, because I thought that would lower the variation, but apparently I’m wrong, because it didn’t change the scatter behavior. Can someone clarify what’s wrong with my train of thought here as well? Answer: I discovered that I didn’t increase it enough. Which brings me to my next question:

Side question: Doing this sampling method, I will have a chance of getting the same data over and over again, right Isn’t this bad for the analysis? Also, I suppose that increasing the sample size will increase the chance of getting duplicate data. Is that correct?

Every input is greatly appreciated. Thx =)

Hi, I just thought I’d make a quick comment on your graph. One of the reasons for why your graph doesn’t match is because of the differing X and Y axis. Try using this instead:

 plt.xlim(-5, 105)
 plt.ylim(100, 350)

As for the others, not sure. I am curious as to why your cutoff points differ from that of Sahil’s–(he used 350, 700, >700)?

Jeeeeeeeeeeeesus, that is correct. My plot was just zoomed in…
About the cutoff points, used the literal value for each strata and he rounded them up.

Not marking as solution just yet because of the other questions. Thanks for showing me what was wrong =)))

Actually, giving it another look, it seems the plots still differ. Sahil’s plot has the lowest point at x=20, while my plot has the lower point at x=60. :thinking:
doubt

Hi,

Just to say I’m having similar issues, my plot doesn’t show the same data Sahil’s does either

image

Similar code to yours as far as I can see, but different results again !!