Predicting House Sale Price With Messy Data

I did enjoy to some extent developing various predictor equations for house sale price using ‘messy data’. It sort of reminds me of the three volumes of the book titled, ‘Analysis of Messy Data’. I may need to study those books.
All feedback is certainly welcome!

Predicting House Sale Prices.ipynb (326.4 KB)

Click here to view the jupyter notebook file in a new tab

1 Like

Hi Bruce,

It’s always a pleasure for me to review your interesting and detailed works :fire: As your previous ones, this project is perfectly structured, has clear goals, all the necessary links, background information (this time, cool overview of data leakage and collinearity), interesting and coherent observations, clean and highly readable visualizations, and great cover picture. Also, very profound data analysis and at the same time fast learning pace.

Below are some comments from my side, mostly about minor technical details:

  • It’s better to put column names in backticks when mentioned in markdown.
  • Avoid too wordy and/or evident comments (# import a whole pile of key python libary, modules to execute various code commands. # use seaborn library to create boxplot, # adding a constant). In general, however, your code commenting is great and very informative.
  • Importing the libraries. These 2 lines:
from IPython.display import HTML
from IPython.display import display, Markdown

can be combined:

from IPython.display import HTML, display, Markdown

Also, throughout the project, I noticed that you imported several times the same libraries. For example, in the code cell [7], you again import matplotlib and seaborn, and starting from the code cell [20], you have many duplicated imports, most probably because of copy-paste issues. To avoid it, a good practice is to import all the libraries in the 1st code cell.

  • Be careful of typos.
  • The code cells [3], [6], [17], and [18]: you can add informative subheadings to the printed outputs and probably separate them in some way (e.g. an empty line).
  • It’s better to avoid naming dataframes or any other variables like df, df2, etc. A better choice is some meaningful, descriptive name. It can help to avoid confusion in future.
  • The code cells [5] and [14]: it’s better to add to avoid unnecessary outputs.
  • The code cells [18] and [21]. Here I would add more code comments. Also, you might consider creating a function for the code in [18] to avoid code repetition.
  • You repeated by mistake the section about data leakage twice, after the code cells [4] and [24]. Probably, it’s better to remove the first occurence of it and keep the second.
  • Conclusion: here, you might consider focusing not on the working process and issues encountered, but rather on the insights obtained throughout the project. Also because your insights are really cool and meaningful, so it would be great to summarize the main points in the conclusion in a concise form.

Hope my suggestions were useful. Great job, Bruce, thanks for sharing! :heavy_heart_exclamation: And good luck with your future projects!

Thank you Elena! Great, great feedback and suggestions!!
Much appreciated!
Best regards,

1 Like