I just finished another project. I’m quite proud of it because I was struggling mightily at the beginning but managed to eventually get it done.
I’m using a dataset I found on Kaggle which contains metadata about 3,000 movies. After performing data wrangling, EDA and feature engineering, I predict the movies revenues using 4 different regression models.
As usual, any feedback is appreciated. In particular, I would like to hear people’s opinion about how to handle the invalid budget values: I chose a drastic approach and dropped the rows with invalid values.
I would also like to know if my choice of dropping the popularity scores outliers is legitimate.
I tried the opposite actions (filling in missing budgets with median value and not dropping any outlier); unsurprisingly, the accuracy I obtained is lower.
At the same time, however, I’m worried the approach I used (dropping invalid data rather than replacing missing values and dropping outliers) exposes our models to overfitting.