Star Wars Survey: How to calculate the percentage across multiple "seen" columns?

Hi everyone! I hope you’re doing great.

So, the reason I’m here is that I’m working on the Star Wars project where I have multiple Boolean columns from `seen_1` to `seen_6` (whether the survey respondent has seen the movie(s) or not). Each `seen` column represents an episode of Star Wars.

With that in mind, I’m trying to replicate FiveThirtyEight work for fun. In their project, they created a couple of bar graphs to show multiple comparisons. One of them is “Of 835 respondents who have seen any film”. I need to understand how they calculated the percentage of these movies? For example, Episode V The Empire Strikes Back is the most-watched movie with 91% and Episode III Revenge of the Sith is the least-watched movie with 66%.

So far I’ve tried multiple random methods (including pct_change) but no luck.

I’ll appreciate it if you could explain it, that’ll be a big help.

Thank you!

Hi @m.awon

For the exact result, the key here is the total no. of respondents!

1. Identify the rows that have a True or 1 as a value for at least one of the six columns (seen or movies). You may use the `.any()` method with an axis set for columns for this.
You will obtain a series in this step.
2. Use the series from step 1. to filter out the rows where at least one movie has been seen.
You should end up with a data frame with a shape of 835 rows and X cols.
3. Calculate the probability by taking the (sum of columns) / (total no. of rows of data frame) obtained in step 2.
This would give you a series of column names as index and values as the proportion. This proportion should match with Five38 probabilities.

I am not sure if you need help with code as well. Try it first and let me know in case you do. If I made any sense you should be able to do it!

Also, I have done calculations only for this graph.

1 Like

Here’s the answer and I was looking many miles away. Thank you!

This is what I’ve been doing before `len(df_seen_1) / len(seen_any)`, `len(df_seen_2) / len(seen_any)`, `len(df_seen_3) / len(seen_any)` and so on… lol

One last thing. In their third graph, how do they calculate the `Top third`, `Middle third`, and `Bottom third` for 471 respondents for rated movies?

I understand they used `all()` and `ranking` columns, but any idea how they calculated these numbers?

1 Like

Hi @m.awon

Thank you for raising these questions!

1. Yes apply the `.all()` method to seen columns and filter the rows to obtain a data frame.
2. Select the ranking columns (you can select all of them) from the df above and use a `.apply()` method to the whole df.
In the lambda function (that will be applied to each series within that apply method!) use value_counts and set the bin to 3 … et Voila!

The second step is a one-line code too.

Let me know what’s your solution for the last graph! I am interested to see your approach.

1 Like

Thank you so much for helping me out!

1 Like

Dude solution to last graph or I will keep pestering you

All I can say at the moment is that Han did not shoot first. Lol

For some reason, I’m keeping trembling. I need to replace `Unnamed ` columns with character names but first I need to look at the data description for confirmation of which column is to replace with which name. I’ve searched up the FiveThirtyEight github and elsewhere online but no success so far.

Oh, I am sorry, I meant the who’s favourite one graph. Wait the data definition doesn’t have the info or the GP also? The GP guidelines may help you, recheck them.

Oh sorry, I wrote my whole story but forgot to ask the actual question haha.

My question is, by any chance do you know where to look for the data description and any information about replacing those `Unnamed` columns with character names?

@m.awon

No problem. Check out the guided project instructions. As I used my own workbook I have already made changes to the column names, so I am guessing we have enough info to work with from GP itself. You don’t have to follow the GP as is, just understand what they have explained about column wrangling and work on your ideas.

1 Like

I see. Thank you for all the help!

Hi @m.awon

I did not review the content of GP but I re-checked the dataset itself. Notice the first record/row. The respondent ID field should be null. Go through it carefully, you will know the sequence in which the column names for the dataset are given.

Apparently, this message is informative but Grammarly says cautionary!

Also, (so much for detail-oriented!), the dataset already has a column which has True for False for all 6 movies if seen. Could have used that too instead of `.any()` or `.all()` methods provided there is not a mismatch for null values.

1 Like

Hello @Rucha,

I noticed while skimming through the online search some people were talking about dropping the first row which refers to the description of the data (the one you’re talking about) but for some reason, that row is already dropped in the dataset that I’ve got.

`Have you seen any of the 6 films in the Star Wars franchise?` I noticed that this column has the information I’m looking for but I realized about it after spending 2 hours extracting information and making graphs, so it was too late when I woke up haha. Plus I don’t mind, I had to go through a different technique, and that knowledge will help me in the future.

1 Like

Hey @m.awon

I checked the column though and I found a mismatch between this column sum and .any() results. So yeah just experiment it out once to confirm.
Perhaps DQ updated the dataset by now so I do have the record and you don’t.

I came across this project a long time ago. It’s confirmed my life is pending and in a constant flux of backlog on my own!

I noticed the results are different which is why using `any()` on seen columns is logically more suitable than just relying on that column.

You’re right, many things have changed since your last visit to the project.

The data needs thorough understanding because there is inconsistency in some of the columns.

2 Likes