Cluster Sampling: It's better to turn a list into a data frame then use df1.append(df2) 283-10

Screen Link:

My Code:

teams = pd.Series(wnba['Team'].unique()).sample(4, random_state=0)
clusters = []

for team in teams:
    cluster = wnba[wnba['Team'] == team]
    clusters.append(cluster)

data = pd.concat(clusters, ignore_index=True)

sampling_error_height = wnba['Height'].mean() - data['Height'].mean()
sampling_error_age = wnba['Age'].mean() - data['Age'].mean()
sampling_error_BMI = wnba['BMI'].mean() - data['BMI'].mean()
sampling_error_points = wnba['PTS'].mean() - data['PTS'].mean()

Using lists is less resource intensive.

Isn’t it better to grow a list and transform it into a data frame than to grow a data frame via df1.append(df2)?

Yes, that approach would be more efficient in this case. I can’t confirm on which one would be more (if at all) memory intensive, but yours would definitely be faster.

1 Like

Yes df.append will get increasingly slow.
You can avoid lists and just df.query the 4 teams you want.

1 Like