Substituting missing values in the test set

Dear all,

I was doing the <Feature Preparation, Selection and Engineering> of the intro to Kaggle course and I found something strange in the way the solution is formulated for handling missing data in the test set. In this case by test set I mean the set you should test at the very end (in the dataquest platform it is called “holdout”), which means that they perform CV on the train set and then they test the final model on this “holdout” set.

Basically, since they have few missing values of one feature in the “holdout” set, they decide to substitute them with the mean of the same feature but of the training set…

holdout[“Fare”] = holdout[“Fare”].fillna(train[“Fare”].mean())

To me it looks strange because I would have expected to use always the “holdout” set also in computing such a mean, in order not to contaminate the final test set. In particular I would have done:

holdout[“Fare”] = holdout[“Fare”].fillna(holdout[“Fare”].mean())

What do you think?

Thanks in advance,
Jessica

Hello Jessica! You know, I read this hours ago and I kept brainstorming the best way to answer it. And also wondering if you were right or not.

Here is what I think. I cannot assure you that it’s the reason though.

I believe that this should have been dealt with before splitting the data into a train and test (holdout) set. All missing value should have been filled with the mean of that column. This would have taken into account the whole dataset so no questions asked.

But now that this is being done after the split, using the mean of that column in the train set makes more sense to me because that set is bigger and the mean will be a lot closer to the mean we would have gotten using the whole dataset, especially considering that the holdout set, besides being smaller already also has missing values. Using the mean from here can really skew the result.

That answer is more a theory than a certainty though. :grinning:

Hello Yemi,

thank you for your answer.

Indeed, also by looking around in internet I have found out two main trends:

1- Someone combines the train and the test set and use all the data to compute, e.g., the mean of that column and substitute the missing values with this mean;
2- Someone else think that a combination of the train and test is a sort of cheating (since in reality you would have no test set ready while training your model), thus they use the training set to compute, e.g., the mean of the column under analysis and then substitute missing values both in the train and test set with the same mean value computed only in the training set.

So in general I would say that your reasoning seems correct!
What do you think about combining test/train set in kaggle competitions?

Best,
Jessica

1 Like

@jessica.lanini Great!! Now, we both learned something. :grinning:

About your question. I would advise against it, simply because it’s extra work and I am not sure it’s necessarily worth it. If the two options are valid and will get us close enough to a reasonable solution then why go through the extra trouble of combining the two sets just to get the mean? Unless I am wrong and you have some other reason to want to combine both?

The reason is because it is statistically better to do it jointly if both train and test data are coming out of the same pool of data. (Law of large numbers)

1 Like

That’s a great point! And if you feel more comfortable doing it then absolutely! :grinning: Personally, I am still quite perplex because the test set is often sooooo much smaller than the train set that I feel the mean of the train set alone would be close enough to the mean of the whole dataset anyway. That said, I, myself, might consider combining them if I feel the test set isn’t so small compared to the train set as to be negligible in such calculations