Correlations do not match

Screen Link:
https://app.dataquest.io/m/217/guided-project%3A-analyzing-nyc-high-school-data/1/introduction

My Code:

correlations = combined.corr()
correlations = correlations['sat_score']
correlations

What I expected to happen:

SAT Critical Reading Avg. Score         0.986820
SAT Math Avg. Score                     0.972643
SAT Writing Avg. Score                  0.987771
sat_score                               1.000000
AP Test Takers                          0.523140

What actually happened:

SAT Critical Reading Avg. Score    0.472399
SAT Math Avg. Score                0.465612
SAT Writing Avg. Score             0.472854
sat_score                          1.000000
AP Test Takers                     0.254925

Refer the solution notebook provided by DQ here.

Refer my notebook here.

I matched the steps given in the solution with my steps, and there seem to be no problems. However, my correlation values don’t match, and are nearly half of the expected values.

Please suggest what can be done.

Thank You.

5 Likes

Thank you for a clear question, this makes it much easier to understand and investigate. Had this question been poorly asked, I don’t think a proper answer would have been achievable.

Take a look around cell run number 8:

data['sat_results']['sat_score'] = data['sat_results'][cols].sum(axis=1)

Here, you’re summing columns in which there are missing values. In fact, for some rows, all the values are missing.

The behavior of this method with default parameters is such that when summing all nulls, it returns 0. From the documentation:

If you include min_count=3, you should get the same results.

3 Likes

Thank you so much for clearing this out.
I have added the skipna (=False) parameter and that has solved the problem.

2 Likes

Thank you for demonstrating clarity in thought and formulating questions. Helpful

1 Like