Screen Link:
My Code:
import pandas as pd
import numpy as np
strata_1 = wnba[wnba['MIN']<=347]
strata_2 = wnba[(wnba['MIN']>347) & (wnba['MIN']<=683)]
strata_3 = wnba[wnba['MIN']>683]
sample_means = []
for i in range(100):
sample_1 = strata_1['PTS'].sample(4,random_state=i)
sample_2 = strata_2['PTS'].sample(4,random_state=i)
sample_3 = strata_3['PTS'].sample(4,random_state=i)
final_sample = pd.concat([sample_1,sample_2,sample_3])
sample_means.append(final_sample.mean())
# Outside loop
plt.scatter(x=np.arange(1,101),y=sample_means)
plt.axhline(y=wnba['PTS'].mean())
plt.title('Minutes Played')
print(sample_means)
What I expected to happen:
To get less variability in the above stratified sampled scatterplot.
What actually happened:
The correlation between :
wnba['PTS'] and wnba['Games Played'] = 0.579
wnba['PTS'] and wnba['Games Played'] = 0.911
Thus, the scatterplot for stratified sampling done above should have less variability whereas, it shows high variability than the scatterplot for( *wnba['PTS'] and wnba['Games Played']* even after having less correlation)