During guided project ‘Exploring eBay Car Sales Data’, on several occasions the challenge is to get statistics about the values in a particular column, while that column has outliers. So you want to get statistics without those outliers.
Let’s take this page as an example:
https://app.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/5/exploring-the-date-columns
I have done initial analysis for the column ‘registration_year’, and observed that my statistics are impacted by outliers (or in this case, just invalid values) being values below 1900 and above 2016.
What I want to do now, is getting the statistics while excluding those values.
I can think of 4 ways to do so.
METHOD 1.
Delete the rows from the dataframe that don’t meet my criteria. (My outliers.)
I could do that (and know how to code it as that is what I did earlier for entries with an invalid ‘price’.)
However, there is a big reason why I don’t like this: when I delete rows from the dataframe, I also delete also the possibly ‘valid’ data in other columns for those rows that I delete.
METHOD 2.
Replace the outliers with another value. For example the average of the other values. This is a bit cumbersome and arbitrary. I also tried to replace by ‘Unknown’, but that doesn’t work. After doing that, I cannot use the .describe()
anymore since I have now introduced string values.
METHOD 3.
Isolate the column-at-hand in a new variable, being a series. Then in that series you can safely drop any outliers, and get your statistics from there. Like this:
reg_year_only = autos['registration_year']
reg_year_only = reg_year_only[reg_year_only.between(1900, 2016)]
That worked. As afterwards
reg_year_only.value_counts().sort_index(ascending=True)
and
reg_year_only.describe()
gave me the desired results.
METHOD 4.
What sounds actually most elegant to me, is to do some ‘code chaining’ that would first filter my original dataframe to exclude my outliers (but without deleting the rows), and then get the statistics about that.
Something like this.
#autos[autos["registration_year".between(1900,2016)]].describe()
#autos[autos["registration_year">1900]].value_counts().sort_index(ascending=True)
# Either of these lines of code will give TypeError: '>' not supported between instances of 'str' and 'int'
As indicated, I don’t manage to get this to work though.
My questions:
- Any thoughts on the 4 methods above, and which one I should best use? (Or is it actually a 5th one … which I missed?)
- For the 4th one that I describe: what would be the correct code?