Histogram Plot using series.plot(kind="X", x="Y") functions

I’m referring to page 4 of the Challenge( Data Cleaning): https://app.dataquest.io/m/102/challenge%3A-cleaning-data/4/filtering-out-bad-data

I’m trying to build a histogram using the series.plot() or dataframe.plot() function here (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.html):
Code:

import matplotlib.pyplot as plt
true_avengers = pd.DataFrame()
avengers.plot(x=“Year”,kind=“hist”)

instead of using avengers[“Year”].hist()

I wonder why it is showing me this graph instead?

hi @chaangallison,

Well, as coincidence was striking I was actually looking at the matplotlib source code today.
What happens after calling avengers.plot(x=“Year”,kind=“hist”) seems mysterious.

image

This one is even more mysterious, and it happens when you actually remove the x= 'Year' parameter! So by specifying the column it actually makes the column disappear.

But after some time I finally understand what is going on.

In the source code there is the following helper-function for plotting histograms:

    def _plot(
        cls,
        ax,
        y,
        style=None,
        bins=None,
        bottom=0,
        column_num=0,
        stacking_id=None,
        **kwds,
    ):
        if column_num == 0:
            cls._initialize_stacker(ax, stacking_id, len(bins) - 1)
        y = y[~isna(y)]

        base = np.zeros(len(bins) - 1)
        bottom = bottom + cls._get_stacked_values(ax, stacking_id, base, kwds["label"])
        # ignore style
        n, bins, patches = ax.hist(y, bins=bins, bottom=bottom, **kwds)
        cls._update_stacker(ax, stacking_id, n)
        return patches

Most of it is not very interesting. But one thing that seems missing is x.

Now you can start asking yourself why is that?

Well it’s because histograms do not have an x-axis. It’s only y-values. So specifying avengers.plot(y='Year',kind='hist') would give the right results!

And in retrospect everything makes sense, the x of a histogram are the bins! Don’t worry, the later courses on statistics will make it all right for you. That’s where visualization started to make real sense to me.

By the way, df.hist(), which would give the same result as above, uses all the numerical columns in the dataframe to make histograms. But because you specify x='Year' it’s exactly 'Year' which gets lost!

By the way 2: A very easy method to specify the data is to give it a Series to start with:
like avengers['Year'].hist()

Cheers :champagne:

Hi @chaangallison I also found an alternative method to remove the x="Year"since you are already extracting the Year column of the dataframe in the next line. Note that the year should be > 1959 or >=1960 since it is inclusive of the year 1960.

import matplotlib.pyplot as plt
true_avengers = pd.DataFrame()

avengers.plot(kind='hist')
true_avengers = avengers[avengers["Year"] > 1959]