Unguided Project: Predicting Movie Revenues

I just finished another project. I’m quite proud of it because I was struggling mightily at the beginning but managed to eventually get it done.

I’m using a dataset I found on Kaggle which contains metadata about 3,000 movies. After performing data wrangling, EDA and feature engineering, I predict the movies revenues using 4 different regression models.

As usual, any feedback is appreciated. In particular, I would like to hear people’s opinion about how to handle the invalid budget values: I chose a drastic approach and dropped the rows with invalid values.

I would also like to know if my choice of dropping the popularity scores outliers is legitimate.

I tried the opposite actions (filling in missing budgets with median value and not dropping any outlier); unsurprisingly, the accuracy I obtained is lower.

At the same time, however, I’m worried the approach I used (dropping invalid data rather than replacing missing values and dropping outliers) exposes our models to overfitting.

Excited to hear what you think.

https://nbviewer.jupyter.org/urls/community.dataquest.io/uploads/short-url/lur8LO7Od2kVq92PR8CadPYGq8t.ipynb (1.7 MB)

Click here to view the jupyter notebook file in a new tab


@Sahil, thanks for fixing the link. I was trying to add the pretty link to nbviewer. I was able to do it last time but I forgot what was the procedure. You think you can help me with that?

1 Like

Hi Giovan Battista,

First of all, congratulations on becoming a Community Champion for this week! :trophy: Your project is really very impressive, great job! :star_struck: :clap:

As for your question on how to add a pretty link to nbviewer, in the case it wasn’t rendered automatically, you’ll find useful the following post:

What you need from there is Option 2 (Super Cool).

Hope it helps!