Star Wars Survey Guided Project for review

Hello,

Here is my project submission. I focused on gender and education level differences.
I some time figuring out how to get the data in the right for to put in the Seaborn visualizations. If anyone has any ideas on how to do it better or more efficiently, I’d love to hear it!

Thank you for your time!

Screen Link:

Star Wars Survey.ipynb (779.6 KB)

Click here to view the jupyter notebook file in a new tab

1 Like

Hi @gosaints,

Thank you for sharing your project. It looks amazing: well-organized, with perfect visualizations, all the necessary links, clean code, very interesting observations (especially about education level and who shot first). You didn’t use much markdown description of your workflow, but your code is perfectly commented (and only where it’s necessary), and you added all the intermediate observations after each section, so your project is easy-to-follow anyway. Also, good idea to add the methodology summary.

Some suggestions from my side:

  • To avoid code repetition, you can consider creating a function for plotting all the grouped bar plots.
  • Probably, it’s better to add some more subheadings in the chapter on data cleaning (for different groups of columns to be cleaned).
  • A good practice is to combine code cells without output into one code cell ([3]-[4], [8]-[9], [21]-[22]).
  • For a long line of code with arguments, like this one from the code cell [13]:
ax = sns.barplot(data=star_wars_gender_melt, x='variable', y='value', hue='Gender', ci=None)

you can divide it into several lines, for better readability:

ax = sns.barplot(
                 data=star_wars_gender_melt, 
                 x='variable', 
                 y='value', 
                 hue='Gender', 
                 ci=None
                 )

One argument for one row. The same thing for the code cells [35] and [39]. This approach is especially good when there are a lot of arguments.

Hope my ideas were helpful.
Great job your project! Keep this high level.

1 Like

@Elena_Kosourova,
Thank you so much for this feedback. It really is incredibly helpful. I have updated my project with your suggestions.

Writing a function to avoid repetition:

I just came up with this for writing a function to plot the Favorite Characters plot for both genders. I basically just copied the code that I had but passed the function argument in for the ‘data’ parameter:

def plot1(df):
    ax = sns.catplot(
                 data=df,
                 kind="bar",
                 x="variable",
                 y="value",
                 hue="level_1",
                 ci="sd",
                 palette="dark",
                 alpha=.6,
                 height=14,
                 aspect=3, 
                 legend_out=False
                )

    # Plot Aesthetics
    ax.set_xlabels('')
    ax.set_ylabels('')
    ax.set_xticklabels(
                       ['Anakin Skywalker', 'Boba Fett', 'C-3PO', 'Darth Vader', 'Emperor Palpatine', 'Han Solo',
                        'Jar Jar Binks','Lando Calrissian', 'Luke Skywalker', 'Obi Wan Kenobi', 'Padme Amidala',
                        'Princess Leia', 'R2D2', 'Yoda'],
                       rotation=45,
                       ha="right"
                      )
    # How do I get this title to change accordingly (Male/Female)?
    ax.fig.suptitle('Characters\' Favorability (Male Respondants)', size=60)
    plt.subplots_adjust(top=.9, hspace=1)
    ax.add_legend()
    sns.set(font_scale=4, style='white')
    sns.despine(bottom=True, left=True)

My thought process is to run this function twice and pass both the char_male_groupby and char_female_groupby data frames separately. One question I have is how to get the title to change accordingly (Male/Female)? Is there a better way to write this?

Another thing I came across while doing this was trying to write it as a grouped subplot (my terminology might be off - I need to study this more) as opposed to two independent plots but it didn’t look right. I read that it might be due to the fact that I am plotting it as an sns.catplot which behaves differently than sns.barplot. Again, I didn’t 100% understand what the article was referencing. Wondering what you think of this? If you need me to explain what I mean further, let me know.

Thank you so much for your time!

1 Like

Hi @gosaints,

Great! Glad to here that my suggestions were useful!

About creating subplots, I cannot see much sense in it, in this case. You have 2 output plots only in the code cell [28] (not so many to consider creating subplots), in all the other cases you have individual plots. By the way, just a curiosity, why did you decide to use catplots in [28]? Is there some specific reason for it? Because I think bar plots would be ok as well.

Now about creating a function. The cool thing here is that you can assign to your function many parameters, not only df! It’s exactly about your question how to add a specific title to each plot. When we define a function, we add some parameters to it, which we put in brackets. When we call this function, we assign to each parameter a corresponding argument, i.e. our real values of that parameters (including a plot title, like in our case).

For example, look at this function:

# Defining a function for creating grouped bar plots
def plot_grouped_bar(df1, df2, column, label1, label2,
                     title, ylabel, ylim_end):
    
    # Converting series to lists for both subsets
    df1_list = df1[column].to_list()   
    df2_list = df2[column].to_list()
        
    # Creating labels from the index (identical for both subsets)
    labels = df1.index.to_list()
    
    # Specifying the label locations and the width of the bars
    x = np.arange(len(labels)) 
    width = 0.35
       
    # Plotting the data for both subsets using grouped bar chart
    fig, ax = plt.subplots(figsize=(10,6))
    ax.bar(x - width/2,
           df1_list,
           width, 
           label=label1)
    ax.bar(x + width/2,
           df2_list, 
           width,
           label=label2)

    ax.set_title(title, fontsize=30)
    ax.set_ylabel(ylabel, fontsize=20)
    ax.set_ylim(0,ylim_end)
    ax.set_xticks(x)
    ax.set_xticklabels(labels, fontsize=17)
    ax.legend(loc=0, fontsize=16, frameon=False)
    plt.show()

This function has 8 parameters, not only df (well, in this example there are 2 dataframes to be used) for which you will assign arguments individually for each plot. But the code to apply will be the same for all of them, and creating the function will let you avoid repetition of this code.

Now look what we do when we call this function on real data:

plot_grouped_bar(df1=males_rankings_mean,
                 df2=females_rankings_mean,
                 column='rankings', 
                 label1='Men',
                 label2='Women',
                 title='Movie Rankings by Gender', 
                 ylabel='Rankings',
                 ylim_end=6)

We assign to each plot a title, legend labels, y-axis label, a column to take, y-axis limit, and so on. We can assign whatever other things we want (colors, font size, figure size, etc.), depending on the parameters that we decided to create for this function.

Using this logic, you can create a similar function for your grouped bars: the code cells [13], [18], [39], and, probably, [28] (if you’ll decide to use bar plots instead of catplots here). For the code cell [28] (again, in case of using bar plots), you’ll just have to run the function twice, separately for each plot.

Hope this post was useful as well :slightly_smiling_face:

1 Like

I see, thank you for explaining this. I understand it better now.

I ended up using sns.catplot because I was having trouble figuring out the aesthetics with sns.barplot such as the xtick labels, rotating them 45 degrees and right aligning them, legend customization, font size, etc.
I tried plt.set_xticklabels, plt.xticks, plt.set(). plt.set_xticks. I got a lot of errors and couldn’t get it to look right. I ended up looking at this:

which seemed just like what I wanted to create so I tried that and had an easier time figuring out the aesthetics. The final products was exactly what I was envisioning so I just went with it.

I had a hard time with the sns.barplot probably because I am not setting up the plot the right way in terms of fig, ax, subplots, etc.

I definitely need to review that part of the course again and gain a better understanding of what exactly is happening when the plots are created and the correct aesthetic functions to call and how to apply them.

1 Like

That’s true it’s not always easy to figure out how to obtain the right customization for your plots. especially considering that the syntax for plt and ax is quite different. I’m always googling a lot, when trying to create a perfect graph.

By the way, about label rotation, probably this article will be useful for your future projects.

Ah, and congratulations for becoming a Champion this week! :partying_face:

2 Likes