Working with missing data - Part 3 of Data Scientist path

In part 3 of the Data Scientist path, there is a lesson about working with missing data in the course Data Cleaning in Python: Advanced. In the lesson, it mentions building a correlation plot for the columns with missing values. I was trying to understand the purpose of this. I understand correlation plots in regards to how they were explained in earlier lessons in the path. For example, if the temperature rises outside in an amusement park there could be a correlation to people buying more water at concession stands. I just don’t understand how it helps us analyze columns with missing values. I took a screenshot below. Hopefully, I explained this well enough. Please let me know if I didn’t.

I’d say, the broader purpose is to help us better understand how to iterate over data exploration and analysis.

  • We created a heatmap to visualize missing values.
  • We found an interesting pattern in the last 10 columns based on the heatmap.
  • We narrowed down on those columns and visualized the correlation because we noticed how those columns had similar patterns of null/non-null values.
  • We found relatively high correlations between 5 pairs of columns. A reminder, we started off by noticing patterns based on missing values.
  • We focus next on these 5 pairs to find:
    • The number of values where the vehicle is missing when the cause is not missing.
    • The number of values where the cause is missing when the vehicle is not missing.
  • We then decide on what to do with the missing values in those 10 columns.

I think the content could have better explained this in a meta sense, but, I would say the value here is more on how to iterate over the analysis which is a skill to develop.

Iterating and narrowing down the scope led to focusing on imputation across multiple columns. The alternative would be that many people would just calculate the number of missing values across different columns and try to impute perhaps just one column at a time. Instead, this approach helped us identify correlations between columns and focus on those together.

I would say it’s a pretty well-crafted lesson in that regard.