Going fast! #DataquestChallenge Premium Annual Offer:
500 get 50% & the next 1000 get 40% off.
GET OFFER CODE

Multi category chi-squared tests

Screen Link:

My Code:

race_education_table <- table(income$race, income$education)
chisq.test(race_education_table)

What I expected to happen:
Statistical test output.

What actually happened:

Output:
Warning message in chisq.test(race_education_table):
“Chi-squared approximation may be incorrect”

Hey @aliabinti.abdulaziz,

There might be a few things that are happening here that are giving that output but I’m suspecting that the biggest reason for this is that there aren’t enough subjects that are equivalently split amongst the different possible combinations of groups b/t the race and education variable to justifiably give an accurate result to say if a statistical difference is really there or not.

This is actually a pretty common thing in certain fields of research, particularly with sociological research once we have to contend with groups that are in the vast minority. One example would be designations of sex/gender or race since there are many categorizations that you won’t have as an equivalent of a split as you think you would. It’s kind of the reason why in most sociological research you would see a bunch of groups just clumped together and compared against some majority group.

Now, this isn’t a bad thing altogether as it’s still kinda correct to continue on with that degree of splitting, it’s just that our confidence to say a statistical difference is actually there is not as great when there is a more equivalent split in groups. So I would say, if you really want to avoid that warning message, you’ll need to make a compromise that doesn’t break down the race variable to every possible group and instead just create 2 or 3 major groupings. An example would be something like “Caucasian vs. Non-Caucasian”.

While this would be a good move statistically, research-wise it sucks because you’ll probably want to be more granular with your data analysis. However, to do that, you’ll need more data to make it work which really isn’t going to be a possibility here. This is sort of the principal case as to why population-level research is such a pain to do.

  • Mike