Would really appreciate suggestions, feedbacks & criticisms--Guided Project: Visualizing Earnings Based On College Majors

Hey everyone!

I just finished my fourth guided project. As an aspiring data scientist, I would like to use this project to help me gain more experience and knowledge in my data science journey. I would really appreciate any suggestions, feedbacks, & criticism! Thank you! ;D

Mission screen

github

4_Guided Project_Visualizing Earnings Based On College Majors.ipynb (1.2 MB)



Click here to view the jupyter notebook file in a new tab

8 Likes

Very nicely done @adrianzchmn! Love your analysis and answering of the questions!

1 Like

Hi @adrianzchmn

Nice. Very thorough and well layed out analysis. Great job. A couple of things I noticed:

  • I am personally not a big fan of having the same information given twice, once in a markdown cell and the second time around as a comment in the python code. In the same vein, while your code is very well documented, some of the descriptions are a bit too verbose for me. For example things like recent_grads.describe() #summary statistics for all of the numeric columns. All you are doing here is calling a basic DataFrame method. I don’t think this needs a description in the code itself.

  • len(DataFrame.index). Good idea to use len() to get the number of df rows. However, you don’t need the index attribute. len(DataFrame) is already sufficient and more concise.

  • Visualization: In most scenarios I would recommend avoiding double-coding of variables. In your scatterplots the y-value is also used for color (hue). The same is true for some of the bar charts (colorfill and x-axis are based on the same variable). Here, you are in essence adding an additional dimension (level of complexity) without increasing the information presented. Yes, it does look nice on the first glance, but it is not considered a good practice in data visualization. Better to avoid it.

  • Histograms: You generate the histograms with a for-loop with a hard-coded binsize (8). While this is efficient, I am not convinced that the chosen bin size is ideal for some of the plots. For the “Men” and “Women” histograms more than 80% of the data fall in the first bin.

Best
htw

3 Likes

Thank you @htw so much for taking the time to give such detailed feedback. I’ll make some updates based on it.

Here is what I’ll do:

  • Simpler description. I agree some descriptions are too wordy and unnecessary.

  • Will use len(DataFrame) to make it concise

  • Simpler visualization. Maximize data to ink ratio.

  • Use proper bin size for histogram

Hey @htw here is the update.

So in this new version, I made some small changes as you suggested:

  • Removed unnecessary comments, e.g.: comments on recent_grads.head(), recent_grads.tail(), recent_grads.describe()

  • Use len(DataFrame) instead of len(DataFrame.index) to get #rows

  • Removed hue (color) on seaborn plots

  • Use the square-root method to determine bin size, reference

Please let me know what you think. I definitely welcome more suggestions if you have any. Thanks again ;D

github link
4_Guided Project_Visualizing Earnings Based On College Majors-update.ipynb (1.1 MB)

Click here to view the jupyter notebook file in a new tab

Hi @adrianzchmn.

Great that you took the time and the effort to revise your project.

Regarding the histograms: What I was trying to say is that using fixed bin sizes for different variables might be a good or a not so good idea depending on the data. Obvously c&p-ing code 8 times and just changing variables and bin sizes is not a good idea either. You want to have the plots generated in a loop (as in your implementation). What I would probably try to do is wrap the body of your loop in a function and call the function for each variable with a specific bin size. In your case maybe something like this (I just made up the bin sizes for demonstration, so please don’t use them in your actual analysis):.

cols = [
    "Sample_size",
    "Median",
    "Employed",
    "Full_time",
    "ShareWomen",
    "Unemployment_rate",
    "Men",
    "Women"
]

# Same length as cols
bin_sizes = [4, 8, 6, 12, 4, 12, 7, 5]

# Define a plotting function 
def plot_hist(df, col, bin_size):
    """Plot histogram for supplied variable and bin size."""
    fig = plt.plot(figsize=(10,5))
    sns.histplot(data=df, x=col, bins=bin_size)
    sns.despine(left=True, bottom=True)
    plt.title(col, weight='bold').set_fontsize('16')
    plt.show()

#  Aggregates elements from each of the iterables supplied and returns tuples.
for var in zip(cols, bin_sizes):
    # Pass arguments to plotting function for each iteration
    plot_hist(recent_grads, var[0], var[1])

This way you can have custom bin sizes with almost the same amount of code.

Maybe this helps for future projects.

BTW: The part fig = plt.subplots(0,8, figsize =(10,5)) in your code doesn’t really work, because you are overriding the fig variable with every pass of the for-loop. So, you just get a plot for every variable and not 8 subplots in 1. I don’t think you actually need to have subplots here, so you can just use fig = plt.plot(figsize=(10,5)).

Best
htw

1 Like

Thanks again @htw for taking the time to write a thorough response. I really really appreciate your feedback. So based on your suggestion, I made another update.

  • replaced fig = plt.subplots(0,8, figsize =(10,5))
    Thanks for the explanation on why the code above didn’t work. Indeed I am not trying to make subplots at all, I was just trying to make a plot, but that code did work to give me the plot size that I want, so I didn’t bother changing it. However, now that I understand what’s going on, I changed it to fig = plt.figure(figsize=(10,5)) *fig = plt.plot(figsize=(10,5)) doesn’t work for me.

  • replaced many fig, ax =plt.subplots(1,1, figsize =(10,5))
    I also replaced a few of these with fig, ax =plt.subplots(figsize =(10,5)) since it yields the same result. I tried using plt.plot and plt.figure, but none of them worked (not sure why, if you know the answer please let me know)

  • Modified the bin size
    I used 14 for most of the bins, but for ‘Men’ and ‘Women’ I use 30 so that not we can get a more accurate representation of the distrubution, as you suggested.

Thanks again for sharing your knowledge! I’m learning from you

Links:
github
4_Guided Project_Visualizing Earnings Based On College Majors-update.ipynb (1.1 MB)

Click here to view the jupyter notebook file in a new tab

@adrianzchmn

I just realized, I never got back to you. But looks good!

One last comment: You are using sp in range (len(cols)) for procedurally generating the plots. This works and a lot of beginners/people do it this way. However, on the long run getting somewhat familiar with Python built-ins like enumerate() or zip() as well as functions from itertools is a good idea. This can feel intimidating at first, but it really pays of when dealing with iteration problems in the end.

Best htw

1 Like

Thank you @htw for your continuous response. Yeah, I am using sp in range (len(cols)), because I thought it looks simpler.

However, after hearing your explanation about functions from itertools, I will definitely make myself more familiar with it, and will use it in future projects. Thanks again! You are really hepful.

Best,

Adrian

Hi @adrianzchmn

This is a great notebook. Just wanted to share something that could improve the styling: I think there are too many sentences in bold.

I see many students using too often bold fonts in the body text. Usually, it’s worth to use bold for the titles but with parcimony in the body text (you should prefer italics if you want to highligth something).

Best
W.

1 Like

Thank you @WilfriedF for your feedback! I agree that there are too many sentences in bold, I was playing around with the formatting when creating this notebook. In the future I am going to experiment with simpler formatting.

Best,

Adrian

1 Like

Hi @adrianzchmn:

I personally use this markdown cheatsheet for formatting my markdown cells. Hope it helps in your future projects!

1 Like

Bookmarked, thanks @masterryan.prof!
Will definitely use it for future projects

1 Like