I would like to share a general tip to everyone doing any guided project ever!
This valuable lesson was learned I while worked on the Guided Project for the Data Cleaning Mission using FiveThrityEight’s Star Wars Survey Data.
Because the last 8 projects that had been posted for this Guided Project also made the same incorrect assumption resulting in the same skewed plots of the episode mean rankings, it’s perhaps a good idea to remind everyone of the importance of fundamental best practices of data analysis!
I also endeavor to prevent someone from making this discovery the long way like I did!
But first, here is my project notebook and my lovely plot of the per-episode ranking distributions
For the Unguided portion I was interested to look at the rankings differently and plot the distribution of the #1, #2, …, #6 rankings per Episode instead of the mean ranking.
As I was generating the counts, it occurred to me as strange that each ranking had 835 values. This meant that every episode had been ranked by all respondents who had seen any – but not necessarily all – movies. How can a person rank a movie they have never seen?
Poking a little deeper, I confirmed that hundreds of rankings values could exist for episodes even where the respondent had indicated they had not seen the movie.
The Rankings – Survey Structure
I discovered that the structure of survey data input for this section was not as I assumed.
- If a respondent indicated they hat not seen any Star Wars episodes, the per-episode rankings were set to null.
- Otherwise, the Rankings for Episode 1-6 was filled in with default values 1-6 respectively.
- Respondent who had seen any or all episodes could change the ranking values per their preference.
- The Episode 1-6 must have a unique value from 1-6, null values not permitted at this point.
- Respondent who had only seen some episodes could have modified some of the rankings per their preference (and left default or entered random rankings for the unseen episodes).
If you would like to know more about how I determined which were definitely invalid rankings and still keep over 50% of the rankings from this group of survey respondent who had only seen some of the episodes – which was not as straightforward as simply nullifying the ranking if the corresponding episode had not been seen – you can find it at the bottom of my notebook in the section “Removing Invalid Ranking Scores”.
With this all done and my lovely histogram created reflects how often a valid ranking was even attributed to an episode, I was ready to wrap up and present my project.
THAT’S when I decided to take look at the information from fivethirtyeight about the dataset (America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters) | FiveThirtyEight)
THAT’S when I learned they had explicitly only taken into consideration the rankings by the 471 respondents who indicated they had seen ALL of the films.
While I do not consider the time spent exercising dataset cleaning and manipulations as wasted time, I would have preferred the efficiency of learning this information by just reading about the dataset!!!
I looked back at the Guided Projects most recently submitted, and did not find any that filtered the dataset as intended. I don’t know if in their cases they looked at the readme file and forgot/overlooked that little tidbit of information, or if like me they simply didn’t look at all and just followed the instructions.
I would love to spend more time and remove the unguided portions of my project and perform additional analysis, but I feel like I’ve come to enough ‘profound’ conclusions here – albeit different in nature than expected – to wrap it and get excited for the next challenge!
Click here to view the jupyter notebook file in a new tab