Exploring Ebay Cars Sales Data

Screen Link:https://app.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/4/exploring-the-odometer-and-price-columns

Your Code: autos["price"] = autos[autos["price"].between(1, 3500001)]your code

What I expected to happen: I tried to filter the column and used the code above to filter (leaving loc out since I only wanted to filter through the column)

The result was that it seems my price column values were changed to a string. Which probably means my values have been swapped with another column’s. Can someone exlain why?

Other details:

1 Like

The price column is now the same as the 1st column of autos. (You can see it when you use autos.head().) It happened because of assigning the filtered autos dataframe to a single column, and it looks like it just put in the first column that fit and called it a day.

1 Like

I got it!!!
Thank you

This code ultimately turned my price column into an object with some pretty obscure values. Am I doing something incorrectly?
image

Hi Chris. The code the original poster entered was an error that did exactly what you’re seeing. This is an example of what we don’t want to do, because autos[autos['price'].between(1,350000)] is going to return a dataframe object. When trying to assign the dataframe object to the series autos['price'], it will only fit the first column, which in the dataframe were the date/time of the listing. Instead of assigning it to autos['price'], it would just be assigned to autos.

Hi April - thank you for the quick response! I guess I’m confused on how I exclude values outside of the range of 1-350,000.

When analyzing the data, it is clear that there are some odd high prices however some seem legitimate i.e. the 10,000,000 Ferrari COULD be legitimate and therefore would I really want to exclude that value? So how would I get rid of certain values in that 350,000 - 10,000,000 range? Using the .between method only shows those values but doesn’t omit them…am I making sense :slight_smile:

When we say autos = autos[autos['price'].between(1,350000)], it will assign the result back to the autos dataframe, so that the autos dataframe will then only contain the rows where the price was between 1 and 350000. Is that a bit more clear?

As far as keeping certain values that fall outside that range, it would depend on what you’re doing with it. If you’re trying to find the average cost of the used cars, a number that large would definitely cause the mean to be higher than most of the cars. If we want a more realistic representation of the cars selling on Ebay, it makes sense to get rid of the outliers.

2 Likes

Got it - thank you so much!

Hi @april.g
thanks for your answers but i see that 35000 is far away from the mean 9840, median 2950, Q1 1100, Q3 7200 even the mode is 0.

I followed that steps in the below link and what I get is that data between 0 ~ 19400 . is it true?

thanks in advance

1 Like

I’m not really sure what you’re asking, Waleed. I’m guessing you’re wondering why we wouldn’t chose the interval (0, 19400) or (1, 19400) instead? That’s probably reasonable as long as it’s an acceptable amount of data loss. One of the reason for making the cut-off at 350,000 has to do with an observation in the value_counts for the price column where we suddenly see a price jump from 350,000 to 999,999. (Have a look with autos['price'].value_counts().sort_index(ascending=True).tail(20) to see what I mean.) Just eliminating those highest 14 values brings the mean from 9840 to around 5700, without having to get rid of too many rows.

It’s up to you how you want to proceed. You just have to explain your justification to the reader so they know why you did it. :slight_smile:

2 Likes

You are the best thanks for this concise but yet explicit explanation

please is there a possibility of getting a car from eBay at $1?, what is lowest reasonable amount a car can go for on eBay? as I feel $1 as the minimum is too low.

I actually don’t know that a car would actually sell for $1 on eBay. Probably not one that runs, anyway. :rofl: I would guess these starting prices are to try to get bidders, but I would imagine that there is a minimum that has to be achieved first. We don’t really have access to that information. You can make a decision on a reasonable range; you just need to explain and justify the reasons to your reader. I can’t remember who it was, but another student who posted a project used their statistics knowledge to determine the range instead, and the results were pretty similar.

I’m getting this message when trying to get rid of the outliers using the exact same line:
autos = autos[autos[‘price’].between(1,350000)]

TypeError: unorderable types: str() >= int()

Hi @mishpedraza, welcome to the community!
Did you perhaps not convert the price column to int before running this code? That’s the first thing I would check. Let me know how it goes.

1 Like

I get is that data between 0 ~ 25500.0

Hi April, I read your answer and understand but what should we do to prevent that from happening? Should we just leave those dates & times as values in the price columns?

No, you definitely don’t want to have dates and times in the price column! It was the result of an error in assigning a dataframe to a single column. You may need to rerun your cells up to this point, and then change the line to say autos = instead of autos['price'] = so that the price column isn’t overwritten with the dates and times.

1 Like

Thanks @april.g, it worked.

you need not to assign your selection to a particular column, instead you should save your selection to autos or some new variable