How to make sense of (var1 vs var1) in histograms (in scatter plot matrices)?

In the Guided Project: Visualizing Earnings Based On College Majors, Step 4 asks to plot a couple of scatter matrices. In particular, to plot a scatter matrix for the columns Sample_size and Median. After plotting, we can see two histograms, Sample_size vs Sample_size and Median vs Median. How does one make sense of these two histograms? That is, if I find that the median income of 40,000 on the x axis corresponds to the median income of 110,000 on the y axis, how do I make sense of it?

1 Like

Understand that a histogram usually displays the frequencies of a given occurrence (with the x-axis being that occurrence).

What a histogram typically represents is the distribution of a set of given values The y-axis in most histograms is thus usually the count, or frequency.

In your scatter matrices where the x-axis and y-axis are the same, the y-axis should simply be thought of as the count. Because it doesn’t make too much sense to actually plot a variable against itself, the points in a scatter matrix where the the axis are the same are used to instead represent value distributions via Histograms, while every other permutation of axes results in a scatter plot (so long as the axes are different).

Though I’m only a fellow learner so I’d suggest you take the word of someone more experienced!

As far as I understand the count would be represented by the number of rows for a certain occurrence, that is when we plot an occurrence (on the x-axis) versus how many times that occurrence happens. But when it comes to plotting x versus x, I don’t think that we get a count. I’m not still quite sure how such a plot makes sense.

Think what a graph would look like if you plotted x, against x (i.e. plotting it against itself).

It would just be a straight line with a gradient of 1! And it wouldn’t give you any useful info because you’re not seeing its relationship with anything.

That’s why the y-axis - when it comes to a histogram in a scatter matrix - shouldn’t be taken at face value, and should instead just be thought of as the “count” (you would ignore the numbers on the y-axis for this purpose). Usually in a scatter matrix you aren’t going to see accurately what the precise counts are, but it can be helpful for gauging the distribution of the values nonetheless.

Here is a scatter matrix I made in one of the DQ projects:

Let’s look at the Histogram right in the middle, where it visualizes the distribution of ‘Median’ values (referring to median salary). We see that a very large amount of majors is represented just under 40,000, in roughly the 35,000-45,000 range. If we tried to include the Y-axis (i.e. also 'Median) into the interpretation, we see the median salary range of 35-40k on the x-axis corresponds to… a 100k median salary on the y-axis? There’s no sense to be made here.

Instead, if we just ignored the original y-axis labels and numbers and thought of the y-axis as representing some generic “count” figure, it would make more sense. We see that surely enough, the very high representation of median salaries in the 35k-40k range is also reflected in the points on the scatter plots immediately above and below the histogram!