Guided Project: Exploring Ebay Car Sales Data [Removing outliers from price and odometer columns]

Screen Link: https://app.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/4/exploring-the-odometer-and-price-columns

My Code:

autos["price"].describe()
autos['price'].value_counts().sort_index(ascending=False)

What I expected to happen: I expected that both describe and sort_index would give me the same min and max values.

What actually happened:

output of autos[“price”].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

output of autos[‘price’].value_counts().sort_index(ascending=False)

99999999       1
27322222       1
12345678       3
11111111       2
10000000       1

Is there something I am doing wrong?
According to me the index of the series returned by value_counts is same as that of the values returned by the describe method.

Thanks in advance!
Cheers!

2 Likes

Both the methods give you the same min value, zero.

The max value seems to be rounded in the output of describe(). Notice that 1*10^8 is equal to 100,000,000 and the max value is 99,999,999 as shown in the output of value_counts().

5 Likes

I removed 0 and 99999999 as price outliers. The next highest value was 27322222. But when I do describe() I get 1.300000e+06 as the maximum for price. Why is this?
To remove the outliers I input:
autos_c = autos_c[autos_c[“price_USD”].between(1,2732222)]

[https://app.dataquest.io/jupyter/notebooks/notebook/Ebay%20Car%20Sales-Copy1.ipynb]

The next highest value after 99,999,999 is 27,322,222, but in your code you used 2,732,222. The second highest value is around 27 million but you typed 2 million.

The next highest value after the one you typed is 1,300,000 which is equal to 1.300000e+06.

2 Likes

I see my mistake, thank you!

1 Like

Hi, can you show me the code how to remove outliner 0 and 99999999? i tried the code showing in the instruction, but it does not work. Not sure if i did it correctly.

autos_c = autos_c[autos_c["price_USD"].between(1,27322222)]

The above is the code I used.

Hi
I just wonder what is the criteria of outliers? what is the acceptable range of prices or odometer?
moreover, should we remove the row or convert the value into the NaN?
thanks for any suggestion.

1 Like

Hi @reza_45,

In my knowledge Outliers in this project for the price is anything that is not acceptable in real world situation. How much would someone pay in order to buy a car through ebay? Will they pay 99999999, 27322222, 12345678 etc?

What I did was to run a google search to check the current prices for a brand new carto which these prices were assigned and decided a maximum price which I thought was acceptable.

Now that was regarding maximum price.

Minimum price can be a bit tricky since it is ebay which is an auction site. So the bid can start at $1 also. But one of our co-learners has found out that in German ebay there are people who sell the cars even for $0! That was a new knowledge for me. So if you want to consider those points, there won’t be any outliers in minimum value.

Regarding odo meter, find out the minimum and maximum values and give it a thought. While looking at it, also look at the year of the vehicle. If the distance and age of the vehicle make sense, you know it is not an outlier.

I hope this helps.

2 Likes

Hi there, when I execute the line below, it seems to convert the price column into a date field.

Any help is appreciated

autos[“price”] = autos.loc[(autos[“price”] > 0) & (autos[“price”] > 27322222)]

Hi @bro

I am not sure how it is possible to show a date field. It would be great if you could share the notebook file to have a better view.

Also, please have a look at the values that you want to select. Are you planning to select values between 0 and 27322222? In that case, have a look again at the logic used.

Alternatively you can use pd.series.between() function.
You can read more about it here.

Hi, I am not sure either to be fair. I tried both 0 and 1 but got the same output. I however tried a solution suggested above and got the right output

autos = autos.loc[autos[‘price’].between(1, 27322222), :]

Glad you got the output. Also, give it a thought if 27322222 is a too high price for a second-hand car or not and what is the next one below it and so on.

Hi @jithins123,

Thanks for the information about the $0 asking price on eBay! I agree with you that in this case, there is no outlier for the minimum value. It might not be a good idea to remove the entries for $0, since its count is not low — it is 1420 counts.

How would you handle the outliers then? Only remove the outliers for the maximum value?