Do we need to scale y--target variable?

Screen Link: https://app.dataquest.io/m/155/guided-project%3A-predicting-car-prices/3/univariate-model

Solution Code: 

# Normalize all columnns to range from 0 to 1 except the target column.
price_col = numeric_cars['price']
numeric_cars = (numeric_cars - numeric_cars.min())/(numeric_cars.max() - numeric_cars.min())
numeric_cars['price'] = price_col

I assumed that the solution first normalized all data columns but also saved a copy of the original target column, which in this case is price_col. Therefore the target variable is not being normalized. I check some other ML channels, some people would normalize the target column and some don’t. Even in this link, people debate about whether it is necessary to scale output y. What is your opinion?

https://stats.stackexchange.com/questions/111467/is-it-necessary-to-scale-the-target-value-in-addition-to-scaling-features-for-re#:~:text=Yes%2C%20you%20do%20need%20to,making%20the%20learning%20process%20unstable.

Does the following looks ok with you when you do scaling in ML?

X = preprocessing.scale(X)
y = preprocessing.scale(y)

 -->
1 Like

Good question.

This, as per my current understanding, is dependent on the approach you are using for model fitting.

For KNN’s, it’s not necessary. It uses distance between the data points as a metric, so normalizing the independent variables (the rest of the columns) makes more sense to avoid any one variable dominating the others. But that doesn’t impact the target variable.

However, it can have an impact on how you interpret the results.

Take the Guided Project.

If you don’t normalize the target variables, you get the following highest RMSE values -

peak-rpm 7697.459696
stroke 8006.529545
height 8144.441043

If you do normalize them, you get -

peak-rpm 0.191089
stroke 0.198762
height 0.202186

Which one is better for you to interpret? Depending on the target variable, this could have an impact.

For some other approaches, it might be of some importance. For example, in Neural Networks it can have an impact. The answers in the post that you shared which point out that the target column should be normalized are relevant more to neural networks (broadly speaking).

2 Likes

Thank you so much Dr. Well explained.