Significance Tests Applied to Data Aggregated by Multiple Columns

Hello Everyone!

I am using the explanations from the Significance Testing section for my uni project. I was given data on adult students and I am testing whether elder females with kids academically outperform younger males with no kids. However, I am not sure if I am using the correct type of tests.

I aggregated the data like this: (file - is the whole dataframe)

elder_female_with_kids = file[(file["Age_by_birth_year"]>40) & (file["Kids"]>0) & (file["Gender"]=="Woman")]
elder_female_with_kids_average_grade = elder_female_with_kids["Average_grade"].mean()

younger_male_no_kids = file[(file["Kids"]==0) & (file["Age_by_birth_year"] <= 40)& (file["Gender"]=="Man")]
younger_male_no_kids_average_grade = younger_male_no_kids["Average_grade"].mean()

mean_difference = younger_male_no_kids_average_grade-elder_female_with_kids_average_grade

Then I followed the techniques suggested in the section to find the p-value.

mean_differences = []
group_a_values = elder_female_with_kids["Average_grade"].tolist()
group_b_values = younger_male_no_kids["Average_grade"].tolist()
all_grades = group_a_values + group_b_values

for i in range(1000):
    males = []
    females = []
    for grade in all_grades:
        random_value = np.random.rand()
        if(random_value >= 0.5):
            males.append(grade)
        else:
            females.append(grade)
    iteration_mean_difference = np.mean(females) - np.mean(males)
    mean_differences.append(iteration_mean_difference)

sampling_distribution = {}

for mean_difference in mean_differences:
    if(sampling_distribution.get(mean_difference, False)):
        val = sampling_distribution.get(mean_difference)
        val = val+1
        sampling_distribution[mean_difference] = val
    else:
        sampling_distribution[mean_difference] = 1
frequencies = []

for key in sampling_distribution.keys():
    if key >= mean_difference:
        frequencies.append(key)

sum_freq = np.sum(frequencies)
p_value = sum_freq/1000

As a result, I got a high p-value, so I concluded that the initial difference was random. However, I don’t understand if this kind of test can be used on data aggregated by several columns? Or is this kind of test appropriate only to understand the dependency between TWO parameters (like the Gender and Average Grade)?

Looking forward to your comments!

The code is problematic in so many ways.

You did mean_difference = younger_male_no_kids_average_grade-elder_female_with_kids_average_grade, why later during permutation tests, you reversed the minus operand order?

This chunk is way too tedious:

for mean_difference in mean_differences:
    if(sampling_distribution.get(mean_difference, False)):
        val = sampling_distribution.get(mean_difference)
        val = val+1
        sampling_distribution[mean_difference] = val
    else:
        sampling_distribution[mean_difference] = 1

You can replace with collections.Counter(mean_differences).

for mean_difference in mean_differences:, your iteration variable here mean_difference is the same as the variable previously used to store younger_male_no_kids_average_grade-elder_female_with_kids_average_grade. That value will be overwritten by the iteration variable’s last updated value which is the last value in mean_differences.

Why did you create sampling_distribution but not use its values? I suppose you want to count how many of each key that is more extreme than your observed difference appeared, but you are summing the keys themselves rather than how many times they appear, that’s probably why P values are huge.

Here’s some details on permutation tests:https://www.sciencedirect.com/topics/mathematics/permutation-test.

I see you are only testing a single categorical of gender in this example, it looks like it works for multiple variables too but not sure how that’s implemented.

1 Like

Hi, Thanks a lot!

Yes, I mixed up the variables, now I corrected that and I got a very reasonable p-value of 11%. The topics on permutations are helpful too! Thanks