Just finished guided project: Visualising earnings based on college majors - any feedback massively appreciated!

Hi guys!

I just finished the visualising earnings based on college majors project, I enjoyed this one a lot as the graphs really brought the exploration and analysis to life.

I’d be super grateful for any feedback/criticisms on things I could improve to help with my learning!

Visualising+earnings+based+on+college+majors.ipynb (827.3 KB)

Click here to view the jupyter notebook file in a new tab

2 Likes

Hi @radiofireworks,

Welcome to the Community and thanks for sharing your great project! :star2: Perfect project structure and using of subheadings, good emphasis (bold font, bullet points), cool storytelling and profound data analysis, super-well-commented code, all the necessary links are present, including the one referring to the Statista website. Good practice to check the length of a dataframe before and after dropping rows. Also, I liked your usage of the hexagonal bin plot, and in the code cell [16] – good idea to put a 50% line. Well done indeed!

Some suggestions from my side, hopefully helpful:

  • About the comments. In some code cells (for example, [6], [12], [13]) the comment lines are too long, maybe you can consider dividing them into several lines, or probably reducing them. Then, the code cell [22] - here you really over-commented your code :slightly_smiling_face: I would suggest you to reduce the comments here to the most important details. For the rest, as I’ve already said, your code is commented just in an excellent way!
  • When you mention column names in the markdown cells, it’s better to always include them in backticks, to make them more eye-catching.
  • The code cell [7]: you can consider using a for-loop here, to avoid code repeating.
  • The code cell [20]: in the cases like this, when there are a lot of bars and long x-tick labels, it’s better to use a horizontal bar plot.
  • A good practice is to divide long lines of code with many arguments into several lines. For example, in the code cell [16], you can divide this line:
ax1 = recent_grads[:10].plot.bar(x='Major', y='ShareWomen', ylim=(0,1), ax = axs[0], legend=False, color="lightblue",
                                 title="Share of women in top 10 majors")

into these several lines:

ax1 = recent_grads[:10].plot.bar(
                                 x='Major', 
                                 y='ShareWomen', 
                                 ylim=(0,1), 
                                 ax = axs[0], 
                                 legend=False, 
                                 color="lightblue",
                                 title="Share of women in top 10 majors"
                                 )

One argument per one line. In this way, we can improve the readability of the code. The same approach you can apply to the other code cells that output graphs (especially those with a lot of arguments, i.e. more than 3).

  • It would be good to despine the plots, remove redundant ticks, and increase titles. You’ll learn all these techniques in the next course “Storytelling Through Data Visualization” and then afterwards can return to this project and apply all those plot aesthetics.

Hope my ideas were useful.
Great job your project, congratulations!

1 Like

Hi @Elena_Kosourova

Thank you so much for taking the time to write feedback, it was incredibly helpful! I’ve tried to implement your suggestions:

  • over-commenting - I think my problem here was that I was essentially writing a lot of ‘notes to self’ in the comments, forgetting that this isn’t really appropriate for the audience. I’ve removed these now.
  • back-ticks for column names - hopefully all column names should have backticks now. I found that quite often I’d refer to columns names like ‘Median’ as ‘Median income’ for clarity. I’m not sure whether I should backtick the latter or not?
  • Long lines of code should now be divided up
  • I’ve stylised the graphs (removed ticks and spines, increased title size)
  • I’ve now implemented a for loop into cell [7]. What I found strange was that when I originally tried to reuse the code from cell [12] which is essentially doing the same thing…
# cell [12]
fig = plt.figure(figsize=(15,25))
# loop over columns and plot histograms
for r in range(0,8):
    ax = fig.add_subplot(4,2,r+1)
    ax = recent_grads[cols[r]].hist(bins = bins, xrot=45, grid = False)

…in cell [7] it didn’t work, and it would result in 12 plots (six empty subplots formatted in the 3 by 2 grid arrangement plus the six scatter plots stacked in a single column)

# cell [7] first attempt
fig = plt.figure(figsize=(15,25))

col_x = ['Sample_size', 'Sample_size', 'ShareWomen', 'Full_time', 'Men', 'Women']
col_y = ['Median', 'Unemployment_rate', 'Unemployment_rate', 'Median', 'Median', 'Median']

for r in range(0,6):
    ax = fig.add_subplot(3,2,r+1)
    ax = recent_grads.plot(x=col_x[r], y=col_y[r], kind='scatter')

Is it something to do with the fact that df.hist() returns a matplotlib.AxesSubplot (or a numpy.ndarray) whereas df.plot() returns a matplotlib.axes.Axes (or a numpy.ndarray).

Ultimately I found that plotting the scatter plot directly with Matplotlib worked rather than using the Pandas wrapper.

Thanks once again!

Visualising+earnings+based+on+college+majors.ipynb (828.3 KB)

Click here to view the jupyter notebook file in a new tab

Hi @radiofireworks,

Your project looks perfect now, and I’m glad that my suggestions were useful!

As for your question:

I found that quite often I’d refer to columns names like ‘Median’ as ‘Median income’ for clarity. I’m not sure whether I should backtick the latter or not?

No, it’s better to apply backticks only when you use real, “official” column names, not rephrased ones. So “median income” in your case (and I saw it in your updated project) is perfectly ok without backticks.

As for plotting in matplotlib or in pandas - well, I also usually prefer the first one. It gives you also a lot of user-friendly options to customize your plots.

Just a couple more comments, for this project and for future ones:

  • The code cells [14] and [21]: you might consider rotating x-tick labels here.
  • The code cells [16], [18], and [22]: in case of many bars (let’s say, more than 6) and also in case of “long” x-tick labels, it’s better to use horizontal bar plots.

Otherwise, everything is really perfect, well-organized and tidy. Awsome job!

1 Like

Brilliant, thank you!

1 Like