Ebay Car Sales - step 5 - Exploring date columns

Screen Link: https://app.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/5/exploring-the-date-columns

I am unable to understand what actually we are required to do.
Can anbody explain the steps a bit more clearly.

Thanks

3 Likes

Yeah!! I’m also facing the same issue again and again, tried several time but remain the same. Hope dataquest solve this issue asap.

Hey folks! Not 100% sure what isn’t clear in the steps of this page — can you let me know what your current level of understanding is so I can try and fill in the gaps?

Best,
Dee

hey @samina.rana and @tusharsingh00

I hope this helps you somewhat.

These are date columns but have data stored as str/object dtype in the format: 26-03-2016 17:47 in the dataset.

This is what the exercise wants us to do:

  • extract only the date-month-year component i.e. only 26-03-2016
  • get a count of each of these dates using value_counts() however in percentage so use normalize as given in task, for each of the series/ column.
  • the date values need to be sorted in ascending order that is oldest/ earliest date to the latest/ recent date, hence use sort_values() method
  • condition is to use all these methods in chain format.

So something like
Series.data_manipulation>.value_counts(normalize.....).sort_values(ascending=True/False)

7 Likes

I had the same question! What the heck is a “distribution”? I mean, I can intuit what it is, but the concept was never before mentioned in the lessons :frowning:

Also, what confuses me is that this is a project, and the instructions seem of the kind you would get in a lesson. In other words. this section seems more like a “type this” instead of a “figure how to do x”. This confused me, because I though I had to be familiar with the “normalize” parameter already, and so on.

But @Rucha came to the rescue. Thanks!

1 Like

so agree with you!!

In this guided project, the more lessor I do, the more i feel just ‘Type it’.
New methods are not explained detailed in this project.

Hope dataquest will fix it!!

2 Likes

Should we be removing the values of registration_year that don’t make sense? After running:

autos[“registration_year”].describe()

and

autos[‘registration_year’].value_counts().sort_index(ascending=False)

You can see that here ~20 years that don’t make sense.

9999 3
9000 1
8888 1
6200 1
5911 1
5000 4
4800 1
4500 1
4100 1
2800 1

1800 2
1111 1
1001 1
1000 1

Edit: It’s the next mission.

hey @eddiea.barillas

Think about it, were cars invented in year 1000? and a car can be registered in the future? Even the movie back to the future 2’s story was based in 2015.

Moreover, Tesla has given us e-car already, so I don’t think these models will get registered in 9999 as well! but that’s just me. :slight_smile:

Could you please tell me why Unique, Top and Freq are returned as NaN for both registration Year and Reg Month when we run the df.describe() function.

hi @jithins123

Have you tried - describe function on individual series for each of the columns?

The official doc here mentions which all info is provided for numeric columns and it doesn’t include the 3 properties you have mentioned.

1 Like

Hi @Rucha,

Thank you for the clarification. I have gone through the documentation.

For numeric data, the result’s index will include count , mean , std , min , max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75 . The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count , unique , top , and freq . The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

So for numerical values Count, unique, top, freq will be NaN and similarly, for categorical values, the stat values will be NaN. Now I can see the pattern. I assumed there must be something wrong in the reg_year column and didn’t explore other possibilities. Thank you for the quick help.