Linear Regression Model WorkFlow Question: Why Predict on Training Set?

Screen Link:
https://app.dataquest.io/m/236/feature-selection/4/train-and-test-model

Hello,

General question here about the Linear Regression model workflow:

I have noticed that in this mission we are using .predict() on the training set as well as the test set.

What is the purpose of predicting on the training set?

I might be misunderstanding this part of the workflow but isn’t the goal to train on the training set and predict on just the test set?

Why are we predicting on the training set as well?

Thank you for your time!

Predicting on training set is done so you can see whether the model can even overfit. If it has no ability to overfit the training, it is likely not going to work on test data too. If it can fit training, but not testing, that is proof of overfitting. If performs well on both train, test, that’s the ideal generalizable model we want.
In regression, all the R-squared metrics you see by default from R language glm output are training set R-squared. You need extra work to get the test set R-squared. They give training set R-squared because you have to fit training before even thinking about whether it fits testing.

The goal is right. As part of that goal in model building, evaluating on training set is important too.
The focus here is not using the predictions of training set, as we do for the test set. The focus is evaluation of model performance for model selection. After all these is done well, then we can use the predictions on test set.

1 Like