Should we fill missing values using mean? Will it have any negative consequences?

Hello, guys! I would start a new thread, but I think is worth to use this one to make another question on the cosequences of the use of fillna() method.

In this exercise:

https://app.dataquest.io/m/347/working-with-missing-and-duplicate-data/11/handling-missing-values-with-imputation

We have to check if filling the missing values of a column with its mean, would change the column mean it self. In the proposed case, the mean keeps the same. In the next part of the exercicise, it was decided to drop the rows with missing values, because to keep’em would affect the distribution.

But If we decided to keep the mean? I thought that it would affects another kinds of analysis where we would like to compare the trend of specific values among different series.

For example, If I would like to check the differences between the evolution of happiness scores in each country per year, I would end up having values that were originally missing, but were replaced by the mean of the entire series. And it would be reflected on a plot with different lines representing the trend for each country.

Is this question really worth of our attention during a data cleaning process? Or is a factor that most of the cases won’t cause any bad impacts in our EDA?

Wish you all a nice weekend!
Paulo

2 Likes

Hi @phssaraiva,

It depends on the situation. If there are only a few missing values, then using means to fill those values can be an option without much negative consequence. However, in cases like analyzing data year by year, it can affect the result if you take the mean of all years. So in those cases, if there are a sufficient amount of values for each year, you should find the mean for that specific year and replace the missing values for that year.

Using mean to replace missing values is not an optimal method, but is a beginner-friendly method.

This article describes the negative consequences of using mean for filling missing values:

https://towardsdatascience.com/stop-using-mean-to-fill-missing-data-678c0d396e22

That said, finding the best way to fill missing values is entirely situational. If you are casually exploring, then even less optimal methods like mean is fine. However, for improving the accuracy of machine learning models, you have to make sure that you understand the data thoroughly before attempting to fill its missing values. This article explains how to handle missing values in great detail:

https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4

Best,
Sahil

1 Like