I just finished another project. I’m quite proud of it because I was struggling mightily at the beginning but managed to eventually get it done.
I’m using a dataset I found on Kaggle which contains metadata about 3,000 movies. After performing data wrangling, EDA and feature engineering, I predict the movies revenues using 4 different regression models.
As usual, any feedback is appreciated. In particular, I would like to hear people’s opinion about how to handle the invalid budget values: I chose a drastic approach and dropped the rows with invalid values.
I would also like to know if my choice of dropping the popularity scores outliers is legitimate.
I tried the opposite actions (filling in missing budgets with median value and not dropping any outlier); unsurprisingly, the accuracy I obtained is lower.
At the same time, however, I’m worried the approach I used (dropping invalid data rather than replacing missing values and dropping outliers) exposes our models to overfitting.
Excited to hear what you think.
https://nbviewer.jupyter.org/urls/community.dataquest.io/uploads/short-url/lur8LO7Od2kVq92PR8CadPYGq8t.ipynb (1.7 MB)
Click here to view the jupyter notebook file in a new tab