Guided Project - Star Wars Opinion Wars - Never Neglect to Learn About the Dataset!

Hello,

I would like to share a general tip to everyone doing any guided project ever!

This valuable lesson was learned I while worked on the Guided Project for the Data Cleaning Mission using FiveThrityEight’s Star Wars Survey Data.

Because the last 8 projects that had been posted for this Guided Project also made the same incorrect assumption resulting in the same skewed plots of the episode mean rankings, it’s perhaps a good idea to remind everyone of the importance of fundamental best practices of data analysis!

I also endeavor to prevent someone from making this discovery the long way like I did!

But first, here is my project notebook and my lovely plot of the per-episode ranking distributions

Star Wars Opinion Wars - Default Rankings Removed.ipynb (276.3 KB)

Click here to view the jupyter notebook file in a new tab

For the Unguided portion I was interested to look at the rankings differently and plot the distribution of the #1, #2, …, #6 rankings per Episode instead of the mean ranking.

As I was generating the counts, it occurred to me as strange that each ranking had 835 values. This meant that every episode had been ranked by all respondents who had seen any – but not necessarily all – movies. How can a person rank a movie they have never seen?

Poking a little deeper, I confirmed that hundreds of rankings values could exist for episodes even where the respondent had indicated they had not seen the movie.

The Rankings – Survey Structure

I discovered that the structure of survey data input for this section was not as I assumed.

  • If a respondent indicated they hat not seen any Star Wars episodes, the per-episode rankings were set to null.
  • Otherwise, the Rankings for Episode 1-6 was filled in with default values 1-6 respectively.
  • Respondent who had seen any or all episodes could change the ranking values per their preference.
  • The Episode 1-6 must have a unique value from 1-6, null values not permitted at this point.
  • Respondent who had only seen some episodes could have modified some of the rankings per their preference (and left default or entered random rankings for the unseen episodes).

If you would like to know more about how I determined which were definitely invalid rankings and still keep over 50% of the rankings from this group of survey respondent who had only seen some of the episodes – which was not as straightforward as simply nullifying the ranking if the corresponding episode had not been seen – you can find it at the bottom of my notebook in the section “Removing Invalid Ranking Scores”.

Click here to view the jupyter notebook file in a new tab

With this all done and my lovely histogram created reflects how often a valid ranking was even attributed to an episode, I was ready to wrap up and present my project.

THAT’S when I decided to take look at the information from fivethirtyeight about the dataset (America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters) | FiveThirtyEight)

THAT’S when I learned they had explicitly only taken into consideration the rankings by the 471 respondents who indicated they had seen ALL of the films.

While I do not consider the time spent exercising dataset cleaning and manipulations as wasted time, I would have preferred the efficiency of learning this information by just reading about the dataset!!!

I looked back at the Guided Projects most recently submitted, and did not find any that filtered the dataset as intended. I don’t know if in their cases they looked at the readme file and forgot/overlooked that little tidbit of information, or if like me they simply didn’t look at all and just followed the instructions.

I would love to spend more time and remove the unguided portions of my project and perform additional analysis, but I feel like I’ve come to enough ‘profound’ conclusions here – albeit different in nature than expected – to wrap it and get excited for the next challenge!

Click here to view the jupyter notebook file in a new tab

6 Likes

Hi @kwu!

Well done on spotting this issue, I did this mistake in my GP on Star Wars. It would be good if DataQuest included this small note in the GP (or refer to the article). Maybe @nityesh may help?

As you said here this project is targeted to the community rather than future employees, and that’s good. I see that you learned a lot from the project:)

I liked your thorough explanations, code commenting, and robust data analysis but here is what can be improved:

  • Write down the questions you want to answer to at the beginning of the project so the reader can have an idea of what to expect
  • You have some typos, you can use Grammarly to correct them:)
  • You can reduce the number of sections and merge "Boolea-ting"s into a more general section like “Data Cleaning and Preparation”
  • Make sure to avoid the warnings (if a person uses future versions of pandas to run the notebook, that may cause problems)
  • I’m not sure what you do in the section " Seen This Episode? Column Value Map". Could you explain it?
  • In [37] you should not print the dataframe to have a nice format
  • It’s better to import all the modules in the first code cell
  • Do you use the seaborn library? Do not import anything just in case:)

Happy coding @kwu :grinning:

4 Likes

Thanks @artur.sannikov96 and @kwu for pointing this out. I will take this feedback back to our content team! :slight_smile:

1 Like

Haven’t completed this project myself yet @kwu, but all I can say is that the visualizations look beautiful! I think I should put seaborn on my list after I complete cementing my foundations on matplotlib!

1 Like

Hi @masterryan.prof,

Thanks for the kind feedback! I am pleased with the progression I’m making on my plots.

As @artur.sannikov96 pointed out, it was misleading for me to load the seaborn module because I didn’t actually use it for the plots in the end. All of the plots can be done with just the matplotlib :smiley: module.

For the multi-colour bar plots, here are the code sources I modified to suit my goals:

https://matplotlib.org/stable/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py

https://towardsdatascience.com/annotating-bar-charts-and-other-matplolib-techniques-cecb54315015

Cheers and have fun playing with plotting!
kwu

1 Like

Hi @artur.sannikov96,

Thanks very much for also checking out this project - you are so kind to share so much of your time and expertise with me :blush:

How to structure the sections is much clearer to me now, I will be very mindful of your recommendations moving forward.

I thought my writing was pretty clean but a digital assistant to watch my back might just be in order!

Seen This Episode? Column Value Map was a poorly labelled and rather inefficient section of code to create maps and apply them to generate column headers and store full episode names. I have some wrist issues so I avoid using the mouse, say to highlight text for copy + paste. So it was an exercise for me to pull true values out of the data instead of defining static maps. I hope that makes sense.

I see now the warning message when I instantiate a series using pd.Series() instructs me to explicitly specify a dtype to avoid the error - I missed that info when I overlooked the horizontal scroll bar on the message. Thanks for also mentioning this issue so that the entirety of the warning message finally came to my attention!

Once again I appreciate all your support to keep me improving :star_struck:

Best,
kwu

1 Like

ohh i see thanks @kwu

@kwu kudos on your find. It was enlightening to say the least.

Can I know how you came across the methodology that fivethirtyeight used to set the ranks for each episode? I’m specifically after the article or post that referred to the Survey Structure.

The Rankings – Survey Structure I discovered that the structure of survey data input for this section was not as I assumed.
If a respondent indicated they hat not seen any Star Wars episodes, the per-episode rankings were set to null.
Otherwise, the Rankings for Episode 1-6 was filled in with default values 1-6 respectively.
Respondent who had seen any or all episodes could change the ranking values per their preference.
The Episode 1-6 must have a unique value from 1-6, null values not permitted at this point.
Respondent who had only seen some episodes could have modified some of the rankings per their preference (and left default or entered random rankings for the unseen episodes).

I thought it would be a good idea to reference the same in the project so that when I later take a look I know why the step needed to be done.

I’d also like to add that, I liked your approach with regards to identifying and removing the invalid data. Specifically the second one.
From the project:

if the episode seen has a rank value higher than the # of episodes seen by the respondent, all rankings on this survey are considered suspect and invalid and are all converted to null.

I hadn’t thought about that!!

However it must be noted that the data you corrected was used by fivethirtyeight to to find out how many of the respondents watched at least one movie. They also used this contrast how the different genders identify themselves as fans.

In the end, its all about the story you are trying to get across.
Keep up the good observations and may the force be with you :smiley: