Blue Week Special Offer | Brighten your week!
days
hours
minutes
seconds

Multi-category chisquare and degrees of freedom

TLDR: Should we be taking into account degrees-of-freedom when performing multi-category chisquare tests

We had degrees-of-freedom originally introduced on https://app.dataquest.io/m/99/chi-squared-tests/8/degrees-of-freedom

When we first used chisquare() we were using one dimensional data such as on
https://app.dataquest.io/m/99/chi-squared-tests/10/using-scipy so the
scipy.stats.chisquare() default degrees-of-freedom of k - 1 - ddof
made sense (k being number observations)

From what I can tell (https://en.wikipedia.org/wiki/Chi-squared_test#Example_chi-squared_test_for_categorical_data),
when using multi-category chi-squares we need to adjust the degrees of
freedom to be (num_rows - 1)(num_cols -1).

When we do the examples on https://app.dataquest.io/m/100/multi-category-chi-squared-tests/4/finding-statistical-significance
we treat the observed values as if they have 3 degrees-of-freedom but
using a 2x2 crosstab I would have expected that we should be adjusting
the degrees of freedom down to (2 rows - 1)(2 cols - 1) = 1*1 = 1 and
therefore setting the ddof parameter of scipy.stats.chisquare() to be 2
which, given we have 4 pieces of data, k - 1 - 2 would give us the
correct 1 degrees-of-freedom.

I am trying to understand the impact of degrees-of-freedom and why we didn’t need to take it into account when doing the missions or the Jeopardy guided project.

Using the following contrived example the difference in ddof makes a considerable difference to the p-value

b    one  two  All
a                 
bar    7    5   12
foo   13    6   19
All   20   11   31
# Observed

o_one_bar = 7
o_two_bar = 5
o_one_foo = 13
o_two_foo = 6

observed = (o_one_bar, o_two_bar, o_one_foo, o_two_foo)

print("Observed:", observed)

# Totals
t_all = sum(observed)
t_one = o_one_bar + o_one_foo
t_two = o_two_bar + o_two_foo
t_bar = o_one_bar + o_two_bar
t_foo = o_one_foo + o_two_foo

# expected
e_one_bar = t_one*t_bar/t_all
e_two_bar = t_two*t_bar/t_all
e_one_foo = t_one*t_foo/t_all
e_two_foo = t_two*t_foo/t_all

expected = (e_one_bar, e_two_bar, e_one_foo, e_two_foo)

print("Expected:", expected)

Observed: (7, 5, 13, 6)
Expected: (7.741935483870968, 4.258064516129032, 12.258064516129032, 6.741935483870968)

print(stats.chisquare(observed, expected))

Power_divergenceResult(statistic=0.32693381180223313, pvalue=0.9548858632175412)

print(stats.chisquare(observed, expected, ddof=2))

Power_divergenceResult(statistic=0.32693381180223313, pvalue=0.5674701732069024)

1 Like