Need feedback on Star Wars Survey

Hi Everyone,

I hope you’re all good. I’m sharing my work for feedback for further improvement. I’ll highly appreciate your time if I can get some of your attention to review the project.

I tried to replicate FiveThirtyEight (pretended like reading paper :smiley: ) to test myself how well I could understand what’s going on in their project.

Huge thanks to @Rucha and @jesmaxavier to help and inspiration.
star-wars-survey.ipynb (789.5 KB)

Click here to view the jupyter notebook file in a new tab


Hey @m.awon,
Thanks for sharing your project. It looks like you thoroughly examined the dataset. I really like your exposition between In [7] and In [8] about how the function works. It reminds me of how teaching a topic is one of the best ways to solidify it in your own mind. I also appreciate how you took the effort to subtitle your graphs. For me, that has always been a frustrating and time-consuming process I tend to avoid.

I’m going to give a few general suggestions up here before moving onto talking about individual graphics and code blocks.

General Comments

  1. Choosing what data visualization to use for a given question is a difficult, yet essential, data science skill. About a year ago, I came across the image below on this forum. It is not a “rule book”, but it might point you towards chart types that you would not normally think of.

  2. When you are creating many different graphs that work together to create a cohesive story, I believe that color (and other stylistic elements) can be leveraged to make connections between different charts.

    • Example: Out[21-23]
      • You are comparing male and female points of views.
      • There is an accepted (though admittedly stereotypical) color scheme used to identify these genders - blue/pink.
      • Using the blue/pink color pair in all three of these graphs helps to tie them together.
      • Additionally, it makes it easier for readers to scan the document.

Specific Comments

Out [26] - it was not immediately clear to me that these numbers were representing percentages. I think this particular data would work very well as a pie chart.

In [17] - In your definition of favorable_dict, you define “Unfamiliar (N/A)” as 4. With the way you number the other choices (1, 2, 3, 5, and 6), it makes “Unfamiliar (N/A)” appear as though it has a meaning midway between “Neither favorably nor unfavorably (neutral)”: 4 and “Somewhat favorably”: 5.

  • This article gives a nice overview or ordinal vs nominal variables: “Nominal vs Ordinal Data: 13 Key Differences & Similarities”.
    • In this case, you have a non-numeric scale that is ordered. “Unfamiliar (N/A)” does not fit nicely in a particular place on this scale, so I would encourage you to set its value to something distinctly outside the “order” you are using for the other answers. In this case, I would choose "Unfamililar": -1

In [24] - It looks like you have “hard-coded” the coordinates of your x-labels in. Instead, try to use the data contained within seen_any_movie to set the value and coordinate of the label. In this case, you are already cycling through a list to place the labels, so instead of cycling through a hard-coded list, cycle through the existing data, sort of like this (assuming seen_any_movie is a DataFrame:

for current_bar in range(seen_any_movie.shape[0]):
    ax.text(x = seen_any_movie[current_bar], y = barlabels_ycoords, ....)
    barlabels_ycoords -= 1

Out [2] - This is not a problem with your interpretation of the data, just something to watch out for in the future. In the “Household Income” column, notice how the font renders strangely? That’s because “$” has special meaning in markup. If you want to learn more about that, here’s some info: Stack Exchange

Overall, solid analysis of the data! Most of the comments I made are on the presentation of your results, which is something that will come more easily as you gain experience.


Hi, @m.awon
Good job!

  1. Try to use pandas.options.display.max_columns = number of columns (38) - to see all columns in the table, while introduction to the dataset.
  2. Some numbers have the mark of % in your visualizations and some don’t, try to make all numbers in your charts with or without % symbol

Thank you for the comprehensive feedback and the explanation. I always look for such comments, they are significant to me for improvement. I appreciate your time.

I tried your suggestion In [24] but it is not working as expected due to my lack of understanding. I wonder if you could help me to explain it in the context of my example where seen_any_movie is a series that has seen_any_movie.index is seen_1, seen_2, …, seen_6, and seen_any_movie.values is a percentage of these six movies respectively.

Thank you!!

1 Like

Thank you for taking the time and go through my work, really appreciate it.

  1. Great point, it could be better to explain all 38 columns in the introduction of the dataset, but due to a large number of undescriptive columns, I thought it is better to talk about them one-by-one in the data cleaning section for the clarity. The reason why I just displayed these 38 columns in Out [3] and discussed the issues with them later.
  2. I agreed that it would not be right to use the % symbol in the visualization the way I did. The only reason is that I was obsessed with FiveThirtyEight and just wanted to replicate their work as close as possible :).

Thank you for pointing out, this helps me how other people are reading/looking at my work.

1 Like

Sure, the idea is that the labels you are placing are the values stored in the seen_any_movie Series, and those labels are placed at locations equal to the values stored in the seen_any_movie Series. As a result, you can pull the value out of each cell of the seen_any_movie series, and then use the value as the x-coord of the next label. Then you can take the value of the x-coord and cast it as a string to use as the text of the label.

Here’s what I imagined:

Setup of the problem

ylabels = ['Episode I The Phantom Menace', 'Episode II Attack of the Clones',
           'Episode III Revenge of the Sith', 'Episode IV A New Hope',
           'Episode V The Empire Strikes Back', 'Episode VI Return of the Jedi']

seen_any_movie = pd.Series(data = {'seen_1':81, 'seen_2':69, 'seen_3':67,
                                   'seen_4':74, 'seen_5':92, 'seen_6':89})
fig = plt.figure(figsize = (8, 4))
ax = fig.add_subplot(1, 1, 1)
fig = seen_any_movie.sort_index(ascending = False)    \
                    .plot(kind = 'barh', ax = ax, width = 0.8)
ax.set_yticks(ax.get_yticks(), fontsize = 18, color = '#656565', labels = ylabels[:-1])

barlabels_ycoords = 4.9

Method for not hard-coding

# Iterate through a counter the length of the seen_any_movie Series
for bar in range(seen_any_movie.shape[0]):
    # Pull out the value at the current position
    current_xcoord = seen_any_movie.loc[bar]
    # Create a text label from the current value
    current_label = str(current_xcoord)

    # Use the dynamically generated current_* variables to draw the label
    ax.text(x = current_xcoord, y = barlabels_ycoords, s = current_label,
            fontsize = 16, color= '#656565')
    barlabels_ycoords -= 1

Hopefully that helps explain what I was suggesting a bit more.


Awesome, it’s all sorted. Thank you for all the help!

1 Like