Remove outliners from price

Screen Link: https://app.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/4/exploring-the-odometer-and-price-columns

My Code:

autos=autos[autos['price'].between(0,99999999)]

What I expected to happen:
I expected to remove 0 and 99999999 in the price column.

Actual—nothing change.

2 Likes

Hi @candiceliu93

So far you’re doing great, but the values that you need are those that are not outliers. So:

  1. Find the outliers, you can do that with .describe and value_counts()
  2. Use autos = autos.loc[autos['price'].between(x, y), :] to remove the outliers

Good luck!

2 Likes

oh I see!! Thank you! Now I understood (x,y). Thank you so much!!

Hi @candiceliu93,
Actually after practically using between(x,y) method. I think between is including x and y values in the result. So, to exclude x and y, I used x+1 and y-1.

2 Likes

Hello, everyone!
I have an issue with removing the outlier.
What pandas function did you use? The ‘df.drop()’ one?

Hi @karevas14

autos = autos.loc[autos['price'].between(x, y), :] is all you need to drop the outliers.

It’s a filter and the df should only have the cars that have a price between x and y

THis doesn’t make sense.
If you use .between (0, 99999999) that is all prices. NOthing left.
THere are multiple outliers in the price.
I think to be safe you start with (1000000,99999999) get ride of everything between 1 million and 99 million.
Then separately get rid of all rows with 0 in price.

How .describe and value_counts() can help in determining outliers?

That’s a nice question!
Describe will show a description of the column like this
imagen
If you check the min, max and the quarter values you can find the outliers. For example, if the max value is 100, and Q3 (75%) is 25, then you can infere that 100 is an outlier.

Same with value counts, remember that value_counts will return a series with the count for each value, an outlier is something that will not happen a lot, so again if you have something like this
value: count
1:100
2:110
3:120
4:50
20:1

Then 20 it’s an outlier, obviously you should consider the context of the data, and aditional info that you might be able to find.

For example if you have shampoo sales data, and you find that they sold 20000 units two times. Is it really an outlier? maybe they did sell that amount of units, so what i would do is ask “Hey, is the company able to sell 20000 units?” and the company might say “No, our production is only 5000” then 20000 is an outlier. But if the company says “Yeah, our production is 5000 and we can have an stock of 16000” then 20000 it’s not an outlier.

Hope i made myself clear
Good luck!

Thanks a lot! It really helped