Questions about the solution for Predicting Car Prices Guided Project

Screen Link: Learn data science with Python and R projects

I have several questions about the Predicting Car Prices Guided Project. Hope it is okay to ask them all in one post.

  1. Why was min-max instead of z-score normalization used like we did in the lesson?
  2. Why was test training split 50-50 when using simple train test validation (without cross validation)?
  3. Why was the decision made to keep the normalized losses column even though it was missing 20% of the data? Is there a general threshold to use?

Hello @erath.kj

If you want a uniform scale between 0-1, you use min-max normalization. All the columns will have their values between 0 and 1. You will be using the largest and smallest values to normalize. There is a risk of using extreme values.

With Z-standardization, the data is normalized to follow a standard normal distribution of mean 0 and standard deviation of 1. When you have outliers, you get large +/- z values for these points. Unlike, min-max, you do not use outliers to normalize. Therefore, z-standardization is less sensitive to outliers.

In my opinion, you choose train and test split depending on the amount of data you have. If you have enough data, you can take up to 50-50. If data is limited, you may want to use cross validation.

After changing the datatype to float, the missing values on each column were replaced by their means. For some, 20% of missing data less is than their threshold for dropping the column. For some others, the column is an important feature.

I understand what min-max and z-score normalization are, but I was trying to understand why one was chosen over the other. What about the airbnb data in the lesson made it okay to risk using extreme values? What about the car data made us want to make the normalization less sensitive to outliers? I’m trying to figure out if I were working on my own project, how would I decide when I need to minimize the effect of outliers and when would I want to include their weight in the normalization.

With the test training split, I thought that a 50-50 split was normally used with cross validation, but when you are using simple test train validation, I thought it was better to use a larger portion of your data while training and leave a smaller portion for testing. I thought a 75-25 split was more appropriate for simple test train validation. And maybe it doesn’t really matter, I just want to know what best practice is.

I wish there were more markdown in the solution notebook to explain the ‘whys’ of how decisions were made to do things certain ways, because that would help us later when we are making decisions on our own projects. Even if there is not a ‘right’ or a ‘best’ way, just understanding the thought process would help.

1 Like