Correlations Not Matching - Where Did I go wrong? (NYC SAT Project - Python)

I’m following along on the NYC SAT project and I’m getting an output that doesn’t match with DataQuest’s.

Below is a screenshot of my output from Jupyter:

Below is a screenshot that DataQuest expects:

WeTransfer link containing a Zip folder of data and the notebook:

What I expected to happen:
Up to this point DataQuest’s and my own outputs appear to match. I’m unsure why the correlations don’t match (they’re roughly half of what is expected.)


Hi @kevindarley2024 and welcome to the community!

Unfortunately I don’t think I was able to spot the source of your discrepancy but I did notice a couple interesting things: the combined dataframe in your notebook has many 0 values in the sat_score column whereas the DQ version has values of 1223.438806 instead.

It’s been a while since I did this walkthrough so I don’t remember all the details but…were we asked to fill in missing values with the mean at some point?

Also, upon immediately running your notebook locally, I noticed the top three correlations are different than what’s being displayed in your screenshot as well as the DQ output:

My best guess is that there’s an issue with the combined dataframe and that the sat_scores within are not completely correct.

Hi Mike!

Thanks so much for the reply.

So I just went through some debugging and you hit it on the head! When comparing the value_counts of the sat_score column, there are 57 nan values in the dq dataset and 57 0’s in my own. After replacing the 0’s in my dataset with np.nan and rerunning the notebook the correlations matched.

DQ didn’t go over this explicitly, however, my methodology varied from DQ when creating the SAT scores. Their methodology automatically set empty vals to np,nan and mine created zeros. Unsure why this happens, but simply replacing their code in mine works (or adding the conversion code.) Screenshots below.

My Code:

DQ Code:

Nicely done! I knew I was in the neighbourhood but couldn’t figure it out. Glad you got it across the finish line!

