Predicting Home Prices in Ames, Iowa - Linear Regression and K Nearest Neighbors

Hi All,
This was, by far, the most intense project I have worked on to date. I enjoyed how real this project felt. I was able to leverage some sector knowledge that I had to perform feature engineering, and do a deeper analysis on geography to get (in my opinion) better representations of the neighborhoods in the dataset.

I am hoping to get some feedback on the methodology I used. I understand that this is a tall order, given how lengthy my notebook is, but I tried to make it as digestible as possible by including visuals and explanatory markdown cells.

I hope you all enjoy reading a bit (or a lot…) about how I approached this problem!

Dataquest Prompt: Predicting Home Prices in Ames, Iowa

PredictingAmesHouseSalePrices.ipynb (1.8 MB)

Click here to view the jupyter notebook file in a new tab

1 Like
  1. wow, I like the geographical approach, almost tempted to try it out and automate it somehow with geopandas, can you tell me what was you avg rmse without fiddling to much with the neighbourhood data? (I’ve just dummied it and that’s it). curious is it worth it the percent you’re getting out of it. (regardless of results it looks amazing)

  2. have u tried different outlier removal techniques? (https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba) I’m in the middle of this project and got a few percent down using IQR score on 1 column (rmse from 16% to 13.2% of avg price)

Hi Adam,
Thanks for the feedback! The geographic approach came to me when I started looking at Ames on google maps. It was a challenge to identify which neighborhoods were where, but in the end, it was a useful exercise since it allowed me to include some datapoints (houses in neighborhoods with small samples sizes) that I would have otherwise needed to drop.

That being said, I ran the model with dummy neighborhood columns as you suggested, and here are the results:

Fit Columns Model Type RMSE StDev
Geo Grouping + All other KNN $39,747 $655
Geo Dummies + All other KNN $39,747 $655
Geo Grouping + All other LR $25,032 $545
Geo Dummies + All other LR $24,265 $599

So the answer to your question is that in KNN model, the grouping I performed seems to have no effect :sweat_smile:. In the LR model, the grouping I performed slightly worsened the RMSE (+3%) and slightly decreased the StDev (-9%).

Seems to me like it might not have produced as much of an impact as I would have hoped!


For your second point, I have not tried different outlier techniques before, but I would like to! Thanks so much for passing this article along - I think I’m going to try inter quartile ranging in the future!

1 Like

righto, thanks for the numbers! I’ll focus on feature engineering then , if you’ re still keen on dropping rmse on this project , this is the best guidance I found:

the (former) top kaggler gives very basic and simple advice which way to go for better results - it works great on my proj so far