Exploring Ebay Car Sale Data- Finding Outliers

@jithins123 @otavios.s @Rucha

please is there any standard formula for finding outliers?

i am still finding it difficult to convince myself that a set of range of values is low outliers and the other high outliers.

please help me out. thanks

3 Likes

Yes, there is a standard way of identifying outliers;

Any value in a range of values that is 1.5 times above the 3rd quartile(75th percentile) or 1.5 times below the 1st quartile (25th percentile) is considered an outlier.

2 Likes

thanks @markmanu21. really appreciate.
i had used this formular below:

Outliers that are considered low fall below **[25% - 1.5*(75% - 25%]


While outliers considered high fall above **[75% + 1.5(75% - 25%)

when i checked the descriptive statistics of the data after removing outliers i got more confused.

1 Like

Descriptive Statistics before removing outliers from odometer_km

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

Descriptive statistics after using the above formula to remove outliers

count     41520.000000
mean     141736.030829
std       17102.004255
min       90000.000000
25%      150000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

am i still on track?
1 Like

Hi @aniefiokuduakobong

Let’s first understand what an outlier is!

In simplest terms, an outlier is a data-point that is very different from the overall data and which warrants some explanation as to why is it so! I said different and not low and high.

Say you are analyzing some countries. You realize that most of these countries are from the African continent. You analyze further and realize there are some countries Yemen, Oman, Saudi Arabia. These 3 countries belong to Asia but are very close to the African Continent.

Going deeper into data you come across this country Uruguay. Now that’s your outlier! Why?
Because neither it is in Africa nor it is anywhere close to any of the other African countries. The criteria I used, is the geographical position of a country to call it either an inlier or an outlier.

In this case, there is no low nor a high. There’s only a given criteria - the data that passes this is an inlier and the data doesn’t conform to this criteria becomes the outlier.

Now Let’s understand outliers in terms of numerical data-points.

Before we discuss the odometer readings, let’s talk about another column from this data set - yearOfRegistration.

image

The minimum value is 1000 and the max value is 9999. However, they are not numbers, they are Years. Let’s see what Wikipedia says about this here:

image

The first recorded invention was in the year 1769 and the first commercial vehicle was manufactured in the year 1908!
So the year 1000 is really an odd case.

The Year 9999. If this were 2020 I will believe it; as in a car bought in the year 2012 goes for registration in 2020, I will believe it as this year anything can happen! :scream: :rofl:

This year is so far in the future, that it does not make sense as a valid data point. Now to check if there are other years like this that are so far away in the future, we subject this column to a BoxPlot. We get a result like this:

When we look at the value counts and select the last 15 rows, we get the following results:
image

If we consider that the dataset was created in the Dec 2019, we can only include the date of registration up till the date the dataset was created, beyond that is like knowing the future.

So for this column I selected valid years as years from 1908 and 2019 (inclusive). Then I get this boxplot:

In either of the above cases, I didn’t use a formula, I used a reasonable validation.

Now we come to the odometer column. The box plot looks like this:

What does it tell us? The box plot considers outliers where the readings are below 90,000kms (approx.). Why did the box plot think that way? We look at the value counts() for this column:
image

More than 60% of data says 150000. So the 5-point summary is dominated by the value 150,000km. As can be seen in the output of the describe method:
image

The 3-summary points: Q2 (median/ 50%), Q3(75%) and Max are all 150,000km.

Boxplot identifies the outlier using the same calculation you have done (your previous post). I replicated that as a calculation.

image

Plotting the odometer values without the outliers as a boxplot, we get this empty weird figure:
image

But that’s how your data is now. The dataset comprises of so many data-points which equal 150,000km that the box plot draws the 4-points of summary - Q1, Q2, Q3 & Max as one single red line.
image

Since the data points below 87500.0 are so far away from Median (central tendency) of the column they are considered as outliers.

Outliers.ipynb (55.4 KB)

So what are your further questions?

4 Likes

@Rucha i am really impressed by your explanation. In fact you began from determining outliers from reasonable validation to validating the formula i put forth. :raised_hands:t2: :clap:t2:

You really funny. This statement got me laughing even though i was not in such mood :rofl: :joy:

The Year 9999 . If this were 2020 I will believe it; as in a car bought in the year 2012 goes for registration in 2020, I will believe it as this year anything can happen! :scream: :rofl:

Thanks for the explanation so i am on track.

So what are your further questions?

Yes. we are told to give observation of last_seen column after exploration.

if the below results were to be your exploration of the column, what would your observation be?

autos['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True)

2016-03-05    0.001126
2016-03-06    0.004804
2016-03-07    0.006005
2016-03-08    0.008558
2016-03-09    0.010560
2016-03-10    0.011210
2016-03-11    0.013787
2016-03-12    0.025598
2016-03-13    0.009609
2016-03-14    0.013412
2016-03-15    0.016590
2016-03-16    0.017441
2016-03-17    0.029777
2016-03-18    0.007507
2016-03-19    0.016890
2016-03-20    0.022020
2016-03-21    0.021770
2016-03-22    0.022370
2016-03-23    0.019442
2016-03-24    0.020844
2016-03-25    0.020218
2016-03-26    0.017666
2016-03-27    0.016890
2016-03-28    0.022045
2016-03-29    0.023046
2016-03-30    0.025448
2016-03-31    0.024597
2016-04-01    0.024222
2016-04-02    0.025673
2016-04-03    0.025623
2016-04-04    0.025323
2016-04-05    0.118457
2016-04-06    0.209063
2016-04-07    0.122410
Name: last_seen, dtype: float64
autos['last_seen'].describe()
count                   39964
unique                  33465
top       2016-04-07 06:17:27
freq                        8
Name: last_seen, dtype: object