How did you deal with the odometer column?

Screen Link:

Hi good afternoon. While working on this project I started to check the odometer data and I feel a bit confused. Even if almost 35% of the rows in the provided dataset use 150000km as the value for the odometer, It’s hard for me to understarnd if it’s a coincidence or if all of them are outliers. In the case of the price it was much clear because there were only a few cars worth over 1 million dollars and they were clear outliers.

How do you think I should deal with the odometer related rows without having any knowledge related to the cars business itself? Should I treat them as outliers? Is there a strategy for knowing when to treat them like outliers and when not?

In advance thanks
Greetings

Hello @brianrey3, welcome to the community!

The values in this column seem to have been rounded. Another possibility is that they were treated as categorical values. In this case, you can think of them as ranges: 0 to 5,000; 5,001 to 10,000 and so on.

In any case, as the highest value is 150,000 and the lowest is 5,000 I’d say all of them seem to be plausible values, therefore, I do not see any outliers.

Hi @brianrey3,

In machine learning often the best results are yielded by getting domain knowledge. So, even when you really do not know anything about cars, it would be a great opportunity to get some knowledge.

After three google searches I found e.g. that the average mileage/odometer count per year is 20.000

One more example about domain knowledge is for example the titanic challenge on Kaggle, there it really matters to know how the ship was built. People are posting complete pictures about where the decks were et cetera :smiley:

There are more friends than stack overflow it seems :wink:

Good luck machine learning!