Log transform - Predicting House Sale Prices

Hi,
I’m working with this dataset in Kaggle and I’ve seen some projects where the target SalePrice has been transformed with a log transform.

I don’t understand why, is it necessary?

For example here:

#applying log transformation
df_train['SalePrice'] = np.log(df_train['SalePrice'])

https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python#5.-Getting-hard-core

2 Likes

Did you read that notebook?

…Ok, ‘SalePrice’ is not normal. It shows ‘peakedness’, positive skewness and does not follow the diagonal line.
But everything’s not lost. A simple data transformation can solve the problem. This is one of the awesome things you can learn in statistical books: in case of positive skewness, log transformations usually works well…

Sometimes it’s better to transform the data to increase the accuracy or to get a better linear coeficient. I recomend you to read that section of the notebook and look into it in statistics books

1 Like

Yes I did.

But should it be done regardless of the algorithm applied, or if we only use linear regression?

Actually it depends of the distribution of the data. If i remember my statistics/physics lessons correctly is not always a good idea to work with left or right skewed data, so you normalize it with different methods, min/max, standard deviaton, transform it in a logaritmic scale and then you use the model that you want to use ie linear regression (wich actually in the case of a logaritmic scale is a logaritmic regression)

This article explains it better
https://towardsdatascience.com/skewed-data-a-problem-to-your-statistical-model-9a6b5bb74e37

2 Likes

Ok thank you! I’ll read that article.

I didn’t know that we also can normalize the output variable so my head :exploding_head:.

I thought these methods (rescaling, normalizing…) were only applied to the input variables to prepare them for machine learning.

2 Likes