Hello DQ community,
I’m following along on the Investigating Fandango Movie Ratings project and am at the point where you realize that the datasets we have at our disposal are not representative due to their sampling methodology. At this point of the project you are supposed to either collect your own sample using webscraping or validate that the 2016-17 sample and the old sample at least follow the same methodology for collection. Specifically, we are supposed to determine that the 2016-2017 data is a stratification of “popular” movies with above 30 user reviews. To accomplish this, DQ tells us to take a random sample of the 2016-17 data and check Fandango’s site to see if these movies have meet the threshold.
The problem, at this point, is that Fandango no longer publishes their own movie ratings, and instead now only show Rotten Tomatoes scores. This creates a problem because the 2015 dataset was sampled using Fandango reviews (votes) counts, which is a metric that is no longer available. I could decide to use Rotten Tomatoes vote counts, but wouldn’t that put me in a similar predicament as I’m in now?
2015 ReadMe: https://github.com/fivethirtyeight/data/tree/master/fandango
2016 ReadMe: https://github.com/mircealex/Movie_ratings_2016_17/blob/master/README.md
I know that a part of this project is to teach us what to do when we run into predicaments like this, but it seems like the accessibility to the data has changed too much? Would love to know your thoughts.
I tried to use Metacritic user review counts as a proxy for popularity, but determined that in the 2015 dataset the minimum Metacritic user review count was only 4 reviews. This doesn’t seem like a reasonable benchmark for popularity, and I don’t know where to go from here. I’m going to put this project off until I get some direction, would appreciate some help or validation that there isn’t a great path forward.
- better to finish project. I have just done it.
- you are right, you can’t check criteria more 30 reviews up today.
- but from another side, unpopular films will not get more than 30 reviews and less than 2.5 stars. So sampling criteria will work.
- practically checking the 0.5 difference bug is well shown with kde density chart and the rest steps are not so important , but they train technique.
BTW, what I really disagree with solution, that 2016 distribution is left skewed, its more likely normal if we take into consideration real stars range 2.5-5 and 2015 distribution is left skewed.
For anyone coming to this post - I think a way to get around this is to use archive.org and set it to the date the second dataset was scraped. When you take your random sample to get a shortlist of movies you might be able to find those movie’s fandango rating in that snippet of time if archive.org has the page available. I haven’t come back to this project to test this, but it should work.