Let's count the number of listings with cars that fall outside the 1900 - 2016 interval

Screen Link: https://app.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/6/dealing-with-incorrect-registration-year-data

I try to see the number of listing by seeing the number of ‘datecreated’ falling outside 1900-2016.

My Code:

autos.loc[autos['year']<1900,:]

What actually happened:

0 row showed

it seems that no data fall outside 1900-2016. Btw, I extracted the year from the ‘datecreated’ column

1 Like

I haven’t yet started this project. But checked the csv file of autos.

In the datecreated column only date of year 2016, 2015 is present. Technically that means no date is less than 1900.

Hello @candiceliu93,

To count the number of listings with cars that fall outside the 1900 - 2016 interval, use the line of code below:
(~autos['registration_year'].between(1900,2016)).sum()

I hope this answers your question.

HI! I think the code you provide is to see how many listing falling between 1900-2016.

The instructions asked us to see how many fall outside 1900-2016. Have i understood it right?

Note that there is a ~ at the begining of the line of code

The ~ operator represents the not logic.

Hi @candiceliu93, hope you’re doing alright

Actually @doyinsolamiolaoye is right, remember that pandas has different operators

  • | = OR
  • & = AND
  • ~ = NOT

So in english doyinsolamiolaoye’s answers says something like the sum of NOT the autos at registration_year column that are between 1900 and 2016. That would give the total number of cars that are outside the range 1900-2016

There’s another way to answer that
autos = autos[(autos['registration_year'] >= 1800) & (autos['registration_year'] <= 2016)]
Here you get a df that has only the autos that have registration year between 1800 and 2016, i used 1800 because the first cars came out in 1886, but now i think that using 1900 is way more better

Anyway try both ways with the years that you choose and try to find the differences. Remember that there’s different ways to answer the excesices. There’s not wrong answers only more or less efficient ways to answer

Good luck!

4 Likes

Thank you!!got it now!

Thank you for explaining it so well. it is so clear!

Hi @doyinsolamiolaoye, could you please clarify why one should use .sum() in the piece of code you sent rather than value .count(), which would make more sense to me (though i tested it and doesn’t work) as we’re counting instances of rows that don’t fall between 1900 and 2016, not actually summing them.

Hope this makes sense, and thanks.

Hi @steven.nagliati

Series. between (left, right, inclusive=True )[source]

Return boolean Series equivalent to left <= series <= right.

This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.

The above given is from pandas documentation.

So you can see the result of Series. between() will be a boolean series. When you apply .sum() to boolean series, it returns the total number of True values. Hence you will get the total count of whichever condition that returns True value.

I hope this clarifies your doubt.

1 Like