Guided Project : Visualizing Earnings Based On College Majors - Histograms

Hi :slight_smile:,

How can we answer this question in step 3 in this Guided Project:

" What percent of majors are predominantly male? Predominantly female?"

By looking at the Histogram from ShareWomen?

imagem

Could I say something like this:

Around 72% of all majors are predominantly composed by women. While men control the other 28%.

Does each bar/interval for example from 0.6-0.8 means that around 27%+27% of all majors are pedrominantly female? Do the y_axis in this histogram represents the percentage each interval represents in the majors?

I am having some difficulties in reading/interpreting Histograms!

Thanks for the help!

Hi Abel! When you create a histogram, the data in the column is split into 10 equally-sized intervals (by default) and then the frequencies are plotted. The ShareWomen column contained the percentages (as decimals), so what the histogram is telling you is a count of how many entries in the column were in certain percentage ranges. I can see how this can be a little confusing, so hopefully the following information will help you.

To see what the histogram is doing, you can add the parameter bins=10 to value_counts() on the ShareWomen column. The data will be broken up into 10 equally-sized intervals, and will count how many items fall into each percentage range.

recent_grads['ShareWomen'].value_counts(bins=10).sort_index()

output:

(-0.0019690000000000003, 0.0969]     3
(0.0969, 0.194]                     14
(0.194, 0.291]                      16
(0.291, 0.388]                      22
(0.388, 0.484]                      19
(0.484, 0.581]                      21
(0.581, 0.678]                      25
(0.678, 0.775]                      29
(0.775, 0.872]                      11
(0.872, 0.969]                      12

Compare this to the histogram, and we can see that that we are finding out how many of the 172 majors listed have a certain percentage of women. For example, according to the list of intervals, about 25 of the 172 majors consist of 58-68% women.

I hope that helps clear up how to read this histogram!

5 Likes

Thanks April, it was very clear to me :slight_smile:

I knew how it worked and the bins/containers, I just didn´t know to what the y axis was referring to. The explanation of the DataSet wasn´t that clear to me. All it said was: "ShareWomen Women as share of total" I was wondering of what total was it referring to…

So its fair to say that more than half of all majors (96Majors/56%) have a percentage of more than 50% of women based on this histogram:

imagem

Inversely, we can affirm that 76 Majors or 44% of all Majors are predominantly men!

Its correct, right? :slight_smile:

I believe so!

I see now on your graph that it didn’t label the y-axis. We probably have different defaults for the graphs, since mine automatically labeled it “Frequency”, and that made it easier to interpret. (I don’t know how to change that yet…)

Glad it worked out, happy coding!

Thanks once again, April :slight_smile:,

Btw, I was trying to do all the histograms resorting to a for loop. Trying to save code, but I see that in some plots, like the last 2, some ticks overlap others, why is that:

cols = ['Sample_size', 'Median', 'Employed', 'Full_time', 'ShareWomen', 'Unemployment_rate', 'Men', 'Women']

fig, ax = plt.subplots(figsize=(12,30))

for i in range(8):
    ax = fig.add_subplot(8,1,i+1)
    ax = recent_grads[cols[i]].hist(bins=10)

??

Sometimes matplotlib makes me want to bang my head on the wall… :rage:

I think what’s happening is that when the initial figure is created, it has a default axis grid that we are then drawing the 8 histograms on top of. (You can see it when you comment out the loop.) The weird overlap seems like it’s coming from there. It goes away when you get rid of the xticks and yticks from the figure before the loop:

cols = ['Sample_size', 'Median', 'Employed', 'Full_time', 'ShareWomen', 'Unemployment_rate', 'Men', 'Women']

fig, ax = plt.subplots(figsize=(12,30))
plt.xticks([])
plt.yticks([])

for i in range(8):
    ax = fig.add_subplot(8,1,i+1)
    ax = recent_grads[cols[i]].hist(bins=10)
1 Like

Hahaha :rofl:

I fully understand you, April!

But now, thanks you, I think I got it right:

Thank you very much :slight_smile:

1 Like

How can I plot a grouped bar plot, comparing the number of Women and Men in each category of Majors?

I tried to follow the documentation, and came up with this:

women = recent_grads['Women']
men = recent_grads['Men']
index = recent_grads['Major_category'].unique()

df = pd.DataFrame({'Women': women, 'Men': men}, index=index)

df.plot.bar()

But the result was this:

imagem

:confused:

I don´t think I´ve yet learned this, that´s why I resorted to the docs, but no luck at all!

Hey April, how could we infer from the Histogram that 56% of Majors are predominantly Female.

I could arrive at this inference by the below process, but am struggling to infer the same from a Histogram of ‘ShareWomen’.

My approach :

I 1st calculated the ‘Female’ & ‘Male’ dominated Majors individually using the below code :
female_dominated_majors = recent_grads[recent_grads["Women"] > recent_grads["Men"]]
male_dominated_majors = recent_grads[recent_grads["Men"] > recent_grads["Women"]]

Then I calculated each of their percentages wrt the total no. of Majors.

Could you pls explain how we could infer the same using a Histogram & the apt column for it.

Hey there. On the histogram, if we want to know how many of the majors are predominately female, we’ll focus on just the x-values above 0.5 (50%). For the bar that represents 0.5-0.6, the frequency is about 24. That means that there were 24 majors where the percentage of women was between 50-60%. The next bar (0.6-0.7), there were about 28 majors where the percentage of women was between 60-70%. So if we add up all the frequencies of just these last 5 bars, we end up with 24+28+28+10+8 = 98, which is more than half of the number of majors listed (172). So we can infer from the histogram that more than half of the majors in the dataset are predominately female.

The shape of the histogram also hints at this. It appears slightly skewed so that the highest bars are after the 0.5 mark. By looking more closely at the numbers (like you did by splitting the dataframe and calculating the percentages) you can verify what is being shown.

Does that help at all?

3 Likes

Absolutely helps April!! :+1:
Thanx a ton - for such detailed explation!! :grin:

1 Like

Follow up question on histograms on this guided project:

I’m trying to use the density argument of matplotlib.hist to normalize the histogram values like this:

recent_grads[‘ShareWomen’].hist(density = True)

However running this results in “AttributeError: Unknown property density”

Thanks!

Hi Preston. My first thought for your issue is the version of matplotlib that’s being used. For example for pyplot.hist(), here is the documentation for matplotlib verison 1.5.1, and here is the documentation for matplotlib version 3.1.1. In the older version, it looks like they use normed instead of density. You can see this post that shows how to find out which version you’re using. Try normed=True and see if that works?

If not, it might be that pandas.Series.hist() doesn’t accept a normed argument. I don’t really know in that case.

To get the result:

  1. we need to get a pivot table where we aggregate the values of men and woman for each major. Here is the code for it:
    r = recent_grads.pivot_table(index = ‘Major_category’, values = [‘Men’, ‘Women’], aggfunc = np.sum)
  2. You need to get the bar plot. Here is the code for it:
    r.plot(kind= ‘bar’)

    Hope it helps!
1 Like

Thanks @april.g for the guidance.

What’s the best way to show the histogram with the percentage of female by major or major category? I believe this has been discussed somewhere in the forum but it would be great to share the link if done so.

Thanks in advance.

Hi Anik. I think what you want is a bar plot since we’re dealing with categorical data. We can utilize the ShareWomen column because it already contains the percentage of women for each major.

We could do something like recent_grads.plot.bar(x='Major', y='ShareWomen') The problem is that there are a lot of majors, so for that we have to restrict ourselves to either the first or last few, or maybe just chunk them at a time (recent_grads[:20], recent_grads[21:41], etc). (I think one of the last instructions in the project has you just find the first 10 and last 10 in 2 separate bar graphs.)

For major category, you could probably use a groupby function like you see earlier in this thread, but focus on the ShareWomen column and use the mean for the aggregate function to get the average percentage of women in each category.
recent_grads.groupby('Major_category')[['ShareWomen']].agg('mean').plot.barh() gives the following:
image
(I don’t like where the legend is placed, but you can adjust that. Probably.)

I hope that helps.

2 Likes

That’s awesome! Thank You so much :slight_smile:

As regards this I must say thank you. I discovered that instead of write code to get rid of the ticks, we could just start off with just creating only a figure(canvas for our plot) like so fig = plt.figure(figsize=(12,30) instead of creating both the figure and subplots using
fig,ax = plt.subplots(figsize =12,30). Because it seems like creating subplot here is interfering with the one we instantiate inside the for loop resulting in an overlap.
I hope this helps someone.