Machine Learning model and validation


I have a question that is probably very stupid, but here it goes.

When we use holdout or cross-validation to measure the performance of the model, is this also a way to develop the model?

For example, I build a model 80% train and 20% test and get rmse of 128. Then I use cross-validation with 5 splits and get average rmse of 125, and rmse of the first iteration is 123.

Does it mean I should choose the training split as it was in the first iteration of cross-validation as my final model to be applied to new data?

1 Like


That’s not how it works. Your model is one thing, how you assess how good it is another thing. Cross validation helps you assess how good it is, but it can’t actually improve the model. The model is what it is.

Say you have a model in production, it makes whatever predictions it makes regardless of the error metric values you get.

If you split the data in a way that minimizes your error metric, the model is still going to predict whatever it will when it looks at new data, regardless of how well it does with the test data (and in fact, test data isn’t even part of the equation at this point, test data isn’t necessary to make predictions).

Moreover, doing that will increase the chances of overfitting.

I hope this was helpful.

1 Like

Yes, thank you!
I think I fully grasp it now. We work with train and test with the purpose of evaluation in mind, and when we’re happy with the result we discard those and create a final model on the entire dataset.

1 Like