Question to the Community
I would like to know what people think about my strategy to remove data where:
- price is 500,000 EUR or more
- price is 1 EUR or less but mileage is only 5,000 km
This is my first project share so bear with me, thank you!
I noticed that in this person’s analysis Link, the statement:
"Mercedes Benz vehicles are by far the most expensive out our top brands, on average costing three times more than the second most expensive brand, Audi."
In my analysis, Audi was only slightly more expensive than Mercedes Benz.
Because I actually speak German a quick glance at the really high priced car names indicated that those around half of the listings were actually “Wanted” postings.
My question to the community - how much time do I spend picking apart outliers?
To me it’s pretty obvious no cars on ebay are likely to cost 10 Million EUR and that most cars are below 500,000 EUR. So is it enough to cut out the most wild (ie. orders of magnitude off) and leave the rest?
I could spend half my time just looking at these outliers to confirm they should be discarded … but is it really that bad if I discard a few wild entries that are actually valid? I mean, even if they are valid, don’t they wreak havoc on my dataset anyway??
kwu_ebay.ipynb (73.1 KB)
Thanks you very much for any kind feedback … I am working on learning formatting and presentation - I realize they are important … but one step at a time!!
Click here to open the screen in a new tab.
Click here to view the jupyter notebook file in a new tab