In the Guided Project: Visualizing Earnings Based On College Majors, Step 4 asks to plot a couple of scatter matrices. In particular, to plot a scatter matrix for the columns Sample_size
and Median
. After plotting, we can see two histograms, Sample_size vs Sample_size and Median vs Median. How does one make sense of these two histograms? That is, if I find that the median income of 40,000 on the x axis corresponds to the median income of 110,000 on the y axis, how do I make sense of it?
Understand that a histogram usually displays the frequencies of a given occurrence (with the x-axis being that occurrence).
What a histogram typically represents is the distribution of a set of given values The y-axis in most histograms is thus usually the count, or frequency.
In your scatter matrices where the x-axis and y-axis are the same, the y-axis should simply be thought of as the count. Because it doesnāt make too much sense to actually plot a variable against itself, the points in a scatter matrix where the the axis are the same are used to instead represent value distributions via Histograms, while every other permutation of axes results in a scatter plot (so long as the axes are different).
Though Iām only a fellow learner so Iād suggest you take the word of someone more experienced!
As far as I understand the count would be represented by the number of rows for a certain occurrence, that is when we plot an occurrence (on the x-axis) versus how many times that occurrence happens. But when it comes to plotting x versus x, I donāt think that we get a count. Iām not still quite sure how such a plot makes sense.
Think what a graph would look like if you plotted x, against x (i.e. plotting it against itself).
It would just be a straight line with a gradient of 1! And it wouldnāt give you any useful info because youāre not seeing its relationship with anything.
Thatās why the y-axis - when it comes to a histogram in a scatter matrix - shouldnāt be taken at face value, and should instead just be thought of as the ācountā (you would ignore the numbers on the y-axis for this purpose). Usually in a scatter matrix you arenāt going to see accurately what the precise counts are, but it can be helpful for gauging the distribution of the values nonetheless.
Here is a scatter matrix I made in one of the DQ projects:
Letās look at the Histogram right in the middle, where it visualizes the distribution of āMedianā values (referring to median salary). We see that a very large amount of majors is represented just under 40,000, in roughly the 35,000-45,000 range. If we tried to include the Y-axis (i.e. also 'Median) into the interpretation, we see the median salary range of 35-40k on the x-axis corresponds to⦠a 100k median salary on the y-axis? Thereās no sense to be made here.
Instead, if we just ignored the original y-axis labels and numbers and thought of the y-axis as representing some generic ācountā figure, it would make more sense. We see that surely enough, the very high representation of median salaries in the 35k-40k range is also reflected in the points on the scatter plots immediately above and below the histogram!