Pandas: how to get column statistics without outliers?

During guided project ‘Exploring eBay Car Sales Data’, on several occasions the challenge is to get statistics about the values in a particular column, while that column has outliers. So you want to get statistics without those outliers.

Let’s take this page as an example:
https://app.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/5/exploring-the-date-columns

I have done initial analysis for the column ‘registration_year’, and observed that my statistics are impacted by outliers (or in this case, just invalid values) being values below 1900 and above 2016.

What I want to do now, is getting the statistics while excluding those values.
I can think of 4 ways to do so.

METHOD 1.

Delete the rows from the dataframe that don’t meet my criteria. (My outliers.)
I could do that (and know how to code it as that is what I did earlier for entries with an invalid ‘price’.)
However, there is a big reason why I don’t like this: when I delete rows from the dataframe, I also delete also the possibly ‘valid’ data in other columns for those rows that I delete.

METHOD 2.

Replace the outliers with another value. For example the average of the other values. This is a bit cumbersome and arbitrary. I also tried to replace by ‘Unknown’, but that doesn’t work. After doing that, I cannot use the .describe() anymore since I have now introduced string values.

METHOD 3.

Isolate the column-at-hand in a new variable, being a series. Then in that series you can safely drop any outliers, and get your statistics from there. Like this:

reg_year_only = autos['registration_year']
reg_year_only = reg_year_only[reg_year_only.between(1900, 2016)]

That worked. As afterwards
reg_year_only.value_counts().sort_index(ascending=True)
and
reg_year_only.describe()
gave me the desired results.

METHOD 4.

What sounds actually most elegant to me, is to do some ‘code chaining’ that would first filter my original dataframe to exclude my outliers (but without deleting the rows), and then get the statistics about that.

Something like this.

#autos[autos["registration_year".between(1900,2016)]].describe()
#autos[autos["registration_year">1900]].value_counts().sort_index(ascending=True)

# Either of these lines of code will give TypeError: '>' not supported between instances of 'str' and 'int'

As indicated, I don’t manage to get this to work though.

My questions:

  1. Any thoughts on the 4 methods above, and which one I should best use? (Or is it actually a 5th one … which I missed?)
  2. For the 4th one that I describe: what would be the correct code?

Hi @jasperquak

I will do my best to answer your 2 questions.

1. Any thoughts on the 4 methods above, and which one I should best use? (Or is it actually a 5th one … which I missed?)

You could try replacing the outlier values with the median of the column of the dataframe, which can be calculated with some code as mentioned in this article.

In your case you may choose to omit the second line which replaces data that is NaN, which is Not a Number with zeros.

2. For the 4th one that I describe: what would be the correct code?
Your error for the code is: Either of these lines of code will give TypeError: '>' not supported between instances of 'str' and 'int'.

You get this error as registration_year is stored as a string instead of an integer in the csv file and thus you may need to convert the year value to an int to perform comparison.

Hope this helps~

1 Like

Hi @masterryan.prof,

Thanks you for the reply.
For the replacement by median values: thank you for the link, that’s an option indeed.

Then, for my ‘method #4’ (trying to get statistics of a filtered version of the dataframe). Yes, the error message seems to point in that direction, however that seems not the issue:

registration_year_is_float

So it is actually a float64. There must be something else wrong my code… or what I try to do is just not possible (no, cannot believe that…).

Hi @jasperquak
When you want to access a group of rows and columns by label , you have to type the rows and columns you want to work with (in this case a boolean row indexing: [autos["registration_year"].between(1900,2016)]) and the column (["registration_years"]).
You must also put .loc to indicate pandas that you want to index a DataFrame with labels instead of numbers.

So the correct code could be like this:
autos.loc[autos["registration_year"].between(1900, 2016), "registration_year"].describe()

autos.loc[autos["registration_year"]>1900, "registration_year"].value_counts().sort_index(ascending=True)

So it is actually a float64. There must be something else wrong my code… or what I try to do is just not possible (no, cannot believe that…).

Referring to the last question, the dtype float64 just indicate the data type of the method values. Note that .describe() and .value_counts() methods print differents dtype on the output.

I’m learning english so I apologize for the poor syntax logic.
I hope I’ve been helpful.

Hi @david.y

Thank you!
That works - your code does exactly what I was trying to achieve, in the way I tried to achieve it!

(No worries about your English btw, it is perfectly clear. And certainly your Python seems pretty good :grinning:.)

2 Likes