Filling in missing values

This is a general data science question about when and how to fill in missing values in a data set. I just finished up the employee exit survey guided project https://app.dataquest.io/m/348/guided-project%3A-clean-and-analyze-employee-exit-surveys/10/perform-initial-analysis.

The instructions recommend filling in the missing values with the most frequent value of that series. Why would we fill them in this way instead of dropping them (or doing a deeper dive of other available data if possible to make more informed guesses)?

Assuming that the missing values for that column are typical, for categorical data, the “most frequent” value is the correct choice.

This seems like a reasonable decision when there are few missing values.

When there are more missing values, but perhaps not so much that you’d drop the column, then you can impute the missing values with a constant.

Lastly, if you don’t mind making a more complex pipeline, then you can impute the missing values by fitting a model to this column, predicting the missing values with other columns in the dataset. You can do this in Sci-kit learn using the Iterative imputer class.

There is a mindmap on chapter 2 page 41 of Real-World Machine Learning by Henrik Brink, Joseph W.Richards, Mark Fetherolf which does a great introduction of dealing with missing values.
It introduces a few considerations:

  1. Does missing data have meaning? (MAR MNAR MCAR?)
  2. Is dataset large or small?
  3. Is data temporally ordered?
  4. Does the data follow a simple distribution?
  5. Does the data have outliers?