Variabillity , Screen 6

In the last para below:

Notice in the histogram above that prices can vary around the mean much more or much less than $79,873 (recall the value of the standard deviation is $79,873.06). Some outliers around $700,000 are more than $500,000 above the mean and a couple of houses around $30,000 are more than $150,000 below the mean.

In the Histogram, it seems that the maximum price is around 600,000 & minimum is 30,000. Do we have to see for the outliers also to measure the distance above & below the mean. Is this important.

  1. Some below mean values are -ve (outliers). How this gets plotted in the hIstogram. This gives a wrong information . Do outliers come into picture without reason.or the ggplot2 library does mistakes.

  2. 10 Bins are specified but only 6 to 7 are used and only 5 bins contain more & less of the frequencies . Are the bin nos. properly specified for the problem.

Hi @sharathnandalike. Regarding your first point:

In the Histogram, it seems that the maximum price is around 600,000 & minimum is 30,000. Do we have to see for the outliers also to measure the distance above & below the mean. Is this important.

Can you please clarify this question? I’ll take a shot based on my understanding of what you are asking. With this histogram it is diffificult to see, but there is at least one observation counted in the most expensive bin range around $800,000. The observation(s) are difficult to see here because the plot is scaled to fit the bin centered around the mean, that has over 1,500 observations. The outliers are considered in the measurement of variability, as stated in the Learn section of this screen:

The standard deviation doesn’t set boundaries for the values in a distribution: The prices can go above and below the mean more than $79,873.

To your second question, this relates to our other conversation about how ggplot2 depicts frequency distributions, by default. Also see this conversation. Again, having bins that appear to span ranges wider than the true spread of the data is a surprising behavior of ggplot2, but it’s debatable to consider it a mistake. This could be a good topic of conversation/debate! Data visualization distills datasets into visualizations that help the reader understand what is going on with the data. There is sometimes a tradeoff between accuracy and simplicity.

Speaking to your 3rd point, there are in fact, 10 bins shown in this histogram. This is difficult to see because there are very few observations in the “most expensive” bin. And there do not appear to be any observations in the second-highest price point. Again, I’d point you to this conversation about selecting the number of bins to show in a plot. The purpose of this histogram is to show that prices can vary around the mean much more or much less than $79,873. Using 10 bins depicts this pattern. But because there are over 2,500 observations in this dataset, the ggplot2 default of 30 bins would probably work fine as well.