Very low Sample_size values in "Visualize Earning based on Major" Guided Project

On step 3 of this Guided Project we are asked to create an histogram to analyze the distribution of the Sample_size values (among others). First of all I don’t really understand why knowing the frequency of values like Sample_size, Male and Female is helpful. What I meant to say is that those values are only really meaningful when considered as a percentage of the whole (e.g. percentage of male graduates out of total graduates).

Anyways, I decided to check to what percentage does the Sample_size most often correspond. To do this I used the Full_time_year_round column since the sample size is taken over that. This is what I found out:

The median Sample_size percentage is just 1.77% which seems to me pretty low to yield meaningful resuts. Am I missing something? Am I reading the Sample_size values wrong?

P.s. Sorry for the poor formatting of this message. It is the first time I post on this forum so I need to get used to it.


Welcome to DQ Community @gbpignatti5

you may refer to this helpful guide for your future queries here

Coming to the sample_size column. As far as I understand we just need to plot the column “Sample_size” as a histogram.

something like this
recent_grads["sample_size"].plot(kind="hist", bins=25, range=(0,5000))

I am not sure what’s the intuition behind dividing this column with another column. Can you elaborate further as to what part of instruction/ task mentions this? It would be really helpful if you can attach a mission link.

1 Like

Dear @Rucha, thanks so much for the guide you shared. That’s exactly what I needed.

As for my question, I understand the project’s instructions only asked to represent the “Sample_size” distribution using an histogram; I have no problem with that.

What I was trying to do is different. As I understand, the “Sample_size” column contains the number of people from each major that data was collected from to calculate the median salary. So for example in the first row (Petroleum Engineering) out of the 1207 graduates with a “Full_time_year_round” job only 36 (the value in the “Sample_size” column) were asked about their salary.

What I achieve by dividing the “Sample_size” column by the “Full_time_year_round” is finding the percentage of people sampled to get the salary information. And my findings is that those percentages are very low (about 3% in the example I made above). This is not the end of the world but it kind of makes me question how valid are these results. I imagine you usually want to sample at least the 10% of the total.

I hope this clarifies what I was trying to say. I’m looking forward to hear your opinion.


hey @gbpignatti5

Okay. This is an interesting query! Please do correct me if I am wrong.

By division between Sample_size and Full_time_year_round columns, you want to ensure if proper sampling was done here. If I can extend it to the application of CLT itself. n <= 10% of N when N is finite and also if sampling was done with replacement or without.

Perhaps we may need to raise the query directly with FiveThirtyEight. There is this post raised by @olamideoshilalu here - Visualize Earnings vs Majors - Median Salary is $40K when Employed Count is 0, also raising a question about the data itself and I saw the associated unresolved error at (538’s) GitHub repo for another data-set.

When I did this project, I was some 100kms away from understanding proper methods of sampling, application of statistics in analysis etc. etc. (Now I am like 98kms away in my learning journey! :stuck_out_tongue_closed_eyes:).

Thank you for raising this question. Definitely makes you wonder if this particular dataset answers more or compels us to raise questions more. Interestingly, the existing DS experts would say “Welcome to Jumanji Data Science! :rofl:


Ok @Rucha, your post makes me feel much better about my findings. I am some 100 light-years away from understanding any significant statistics but I do recognize that using less than 2% sampling is probably not very meaningful.

I also found out that in many cases the numbers in the Full_time and Part_time columns don’t add up to the number in the Employed column. I had already seen the post by @olamideoshilalu but I think the issues I raised are even more critical since they pertain to the whole dataset.

I am really liking Data Quest so far but I feel like sometimes (especially during the Guided Project) it tends to give instructions and ask questions that are a little bit confusing or not very meaningful. Something similar was raised in this thread - Can someone answer this questions in 2nd chapter of Guided Project: Visualizing Earnings Based On College Majors.
I think I’ll start using the instructions on Guided Project as general guidelines and try to slowly work more independently.

1 Like

Hi @gbpignatti. Welcome to DQ community.

I agree that there are quite a few problems with the dataset. The number of employed and unemployed do not always add up to the total, and the sum of college and non-college jobs don’t give the number of employed. I guess only FiveThirtyEight can address those issues

On sample sizes, yes, many of the sample sizes were too small. But the sample size doesn’t always have to be a particular percentage of the population, especially when the population is very large. You can check out this document and this article on sample sizes.


Thank you @olamideoshilalu. As I said I don’t have yet the statistical competence to make valid statements. I was just using my intuition. But the links you shared are helpful.

I guess, I’ll just use the dataset as it is without worrying too much about the incongruences.