Some confusions about Training and Test data

Hi, I have some confusions about training and test data. Most people recommend that do not use the test data during the model building process, only use it to measure the performance at last when you are done with all the steps for final test mse and rmse suppose if it’s a linear regression problem.

But when I am doing exercises or making projects on the Dataquest platform, we often use the test set to evaluate the performance along with the train set extensively.

So, I am feeling confused. What should I do? Can you help me to clear this confusion

Another thing I noticed that when I try to get some data from kaggle for projects, I find that the test data set doesn’t have the target values which is obvious because this is what they want us to predict. So only way I can evaluate the performance of the model using the training set.So, I am always feeling confused about how to use the training set and the test set.

Please, help me to solve this problem. Thanks for reading.

2 Likes

Yes you can have endless levels of nesting for each set of parameters you want to test. And within a level, the same row of data can be part of training set in 1 experiment and testing set in another (cross-validation/out-of-bag estimation). A training/testing set can contain as little as 1 row.
Here is a summary of the variety of splitting methods possible
https://scikit-learn.org/stable/modules/cross_validation.html

First set aside the concept that data can only be split into 2 parts called training and testing.
They can be split into infinitely many. You can split to train and test. Then take the train and split into train and test within train. You can repeat forever, but people usually nest up to 1 level. The inside train-test-split is for finding hyperparameters, and testing if those hyperparameters are the best. The outside train-test-split is for finding model weights, and use these weights and previously found best hyperparameters from inner train-test to generate predictions for you to evaluate on outer level test.
Yes kaggle expects you to split their train 1 more level into train and test.

Thanks, hanqi for clarifying things. Hope you have a Good Day.