Hi All,
This was, by far, the most intense project I have worked on to date. I enjoyed how real this project felt. I was able to leverage some sector knowledge that I had to perform feature engineering, and do a deeper analysis on geography to get (in my opinion) better representations of the neighborhoods in the dataset.
I am hoping to get some feedback on the methodology I used. I understand that this is a tall order, given how lengthy my notebook is, but I tried to make it as digestible as possible by including visuals and explanatory markdown cells.
I hope you all enjoy reading a bit (or a lot…) about how I approached this problem!
Dataquest Prompt: Predicting Home Prices in Ames, Iowa
PredictingAmesHouseSalePrices.ipynb (1.8 MB)
Click here to view the jupyter notebook file in a new tab
1 Like
Hi Adam,
Thanks for the feedback! The geographic approach came to me when I started looking at Ames on google maps. It was a challenge to identify which neighborhoods were where, but in the end, it was a useful exercise since it allowed me to include some datapoints (houses in neighborhoods with small samples sizes) that I would have otherwise needed to drop.
That being said, I ran the model with dummy neighborhood columns as you suggested, and here are the results:
Fit Columns |
Model Type |
RMSE |
StDev |
Geo Grouping + All other |
KNN |
$39,747 |
$655 |
Geo Dummies + All other |
KNN |
$39,747 |
$655 |
Geo Grouping + All other |
LR |
$25,032 |
$545 |
Geo Dummies + All other |
LR |
$24,265 |
$599 |
So the answer to your question is that in KNN model, the grouping I performed seems to have no effect
. In the LR model, the grouping I performed slightly worsened the RMSE (+3%) and slightly decreased the StDev (-9%).
Seems to me like it might not have produced as much of an impact as I would have hoped!
For your second point, I have not tried different outlier techniques before, but I would like to! Thanks so much for passing this article along - I think I’m going to try inter quartile ranging in the future!
1 Like
righto, thanks for the numbers! I’ll focus on feature engineering then , if you’ re still keen on dropping rmse on this project , this is the best guidance I found:
the (former) top kaggler gives very basic and simple advice which way to go for better results - it works great on my proj so far