I’m working with this dataset in Kaggle and I’ve seen some projects where the target
SalePrice has been transformed with a log transform.
I don’t understand why, is it necessary?
For example here:
#applying log transformation
df_train['SalePrice'] = np.log(df_train['SalePrice'])
Did you read that notebook?
…Ok, ‘SalePrice’ is not normal. It shows ‘peakedness’, positive skewness and does not follow the diagonal line.
But everything’s not lost. A simple data transformation can solve the problem. This is one of the awesome things you can learn in statistical books: in case of positive skewness, log transformations usually works well…
Sometimes it’s better to transform the data to increase the accuracy or to get a better linear coeficient. I recomend you to read that section of the notebook and look into it in statistics books
Yes I did.
But should it be done regardless of the algorithm applied, or if we only use linear regression?
Actually it depends of the distribution of the data. If i remember my statistics/physics lessons correctly is not always a good idea to work with left or right skewed data, so you normalize it with different methods, min/max, standard deviaton, transform it in a logaritmic scale and then you use the model that you want to use ie linear regression (wich actually in the case of a logaritmic scale is a logaritmic regression)
This article explains it better
Ok thank you! I’ll read that article.
I didn’t know that we also can normalize the output variable so my head .
I thought these methods (rescaling, normalizing…) were only applied to the input variables to prepare them for machine learning.