Hello Everyone!
I am using the explanations from the Significance Testing section for my uni project. I was given data on adult students and I am testing whether elder females with kids academically outperform younger males with no kids. However, I am not sure if I am using the correct type of tests.
I aggregated the data like this: (file - is the whole dataframe)
elder_female_with_kids = file[(file["Age_by_birth_year"]>40) & (file["Kids"]>0) & (file["Gender"]=="Woman")]
elder_female_with_kids_average_grade = elder_female_with_kids["Average_grade"].mean()
younger_male_no_kids = file[(file["Kids"]==0) & (file["Age_by_birth_year"] <= 40)& (file["Gender"]=="Man")]
younger_male_no_kids_average_grade = younger_male_no_kids["Average_grade"].mean()
mean_difference = younger_male_no_kids_average_grade-elder_female_with_kids_average_grade
Then I followed the techniques suggested in the section to find the p-value.
mean_differences = []
group_a_values = elder_female_with_kids["Average_grade"].tolist()
group_b_values = younger_male_no_kids["Average_grade"].tolist()
all_grades = group_a_values + group_b_values
for i in range(1000):
males = []
females = []
for grade in all_grades:
random_value = np.random.rand()
if(random_value >= 0.5):
males.append(grade)
else:
females.append(grade)
iteration_mean_difference = np.mean(females) - np.mean(males)
mean_differences.append(iteration_mean_difference)
sampling_distribution = {}
for mean_difference in mean_differences:
if(sampling_distribution.get(mean_difference, False)):
val = sampling_distribution.get(mean_difference)
val = val+1
sampling_distribution[mean_difference] = val
else:
sampling_distribution[mean_difference] = 1
frequencies = []
for key in sampling_distribution.keys():
if key >= mean_difference:
frequencies.append(key)
sum_freq = np.sum(frequencies)
p_value = sum_freq/1000
As a result, I got a high p-value, so I concluded that the initial difference was random. However, I don’t understand if this kind of test can be used on data aggregated by several columns? Or is this kind of test appropriate only to understand the dependency between TWO parameters (like the Gender and Average Grade)?
Looking forward to your comments!