Predicting House Prices with Linear Regression

Feedback on how to improve the model.

  • I’ve seen projects with simpler features result in better RMSE - did I over complicate things or is this a result of feature engineering and data cleaning? Are there any obvious things I’m missing?
  • How should the RMSE and STD trade-off be interpreted? Do I target the intersection for optimization?
  • When performing k-fold linear regression how do I extract the column predictions? I’d like to see which rows have the worst predictions and see if there are trends.

Predicting House Sale Prices.ipynb (959.2 KB)

Click here to view the jupyter notebook file in a new tab

1 Like

Hi @kevindarley2024

Sorry to disappoint you this is not feedback :frowning: This project is just too big. I tried completing it before I could put something meaningful here. I am still trying to cover this project. I guess I will follow up in a few days.

I have these questions for now:

  • why does your title say “car” instead of home/house
  • have you investigated the rows where there are negative differences betweenGarage Yr Blt & Year Built columns? or is it me reading the wrong information :thinking: :frowning_face:
  • this may or may not be an advanced topic at this stage, but considering the column Central Air have you come across the “class imbalance” concept?
    In case you have or are going to park this topic for later learning/ advance steps in the learning path please ignore it. We can discuss this later in future.

I guess I will bother you on this post as and when I advance going through your project.

1 Like

Hi Rucha,

Thank you for your time! I know it’s a big project and any time you give is really appreciated.

I went ahead and changed the title to house, that was a mistake.

The Garage Yr Bltand Year Built represent the year in which a garage was constructed and the year the house was constructed respectively. It’s possible that a garage can be older than the house it is associated with, but I imagine it is rare. A case might be an old house is torn down and the garage is kept and a new house is built on the property. When I went through the dataset I typically saw that the garage, if there was one, was added after the house. I looked mainly for outliers or impossibilities (built in the future) when cleaning the year columns.

I am not familiar with ‘class imbalance’ in those terms, but part of the lessons leading up to the project taught us about building in variance tolerance into feature selection, the thought being that if a feature has a large skew into one of the feature’s attributes then that feature likely won’t be a valuable contribution to the model. This is a variable var_tol that I built into the final model where you can test how changing this tolerance impacts the model’s performance. I briefly read up on class imbalance and thought the concepts were the same, would you agree?

Please keep bothering me about this post :). Thank you!