I dont undersand step 5

Screen Link: https://app.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/5/exploring-the-date-columns

''Use the workflow we just described to calculate the distribution of values in the date_crawled , ad_created , and last_seen columns (all string columns) as percentages.

  • To include missing values in the distribution and to use percentages instead of counts, chain the Series.value_counts(normalize=True, dropna=False) method.’’

Could some one give the answer to this instruction and explian the answer?

HI! @candiceliu93

Workflow is mentioned in the exercise tutorial, workflow is like an approach to attain a particular task. So, here the task is to get the sorted distribution of the values in a percentage for date_crawled , ad_created , and last_seen columns.

To explain, I am taking only one column and the same procedure you can apply to rest of the columns.

We know that all the columns are strings, So in date_crawled column we need to take dates distribution for which,

First, we have to extract/parse only date from the each row of string in date_crawled column
Code:
autos['date_crawled'].str[:10] #Because the date is only till 9th index

Then, you have to see the distribution/frequency in percentages which you can achieve with:
.value_counts(normalize=True, dropna=False) # normalize= True will give you percentage and False will give only counts.
So, now code will be:
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False)

Next you have to sort the distribution in ascending order which you can achieve with
.sort_index() #Here you are sorting values in reference of index.
So, final code will be:
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Now, you can repeat the same steps on rest of the columns.
Hope this helps!

2 Likes

Thank you for explaining it.

That is what I did, I thought the output will show number %.

1 Like

Hello @harsh.raizada ,

Same like everyone i was stuck here but thank you for your explanation.

but I still have doubt here.

Why we are calculating the frequency/ percentage here for those columns.?

I dont understand purpose of this step here ?
To include missing values in the distribution and to use percentages instead of counts, chain the Series.value_counts(normalize=True, dropna=False) method.

1 Like

Hi! @hulesameer1149

We need this step to calculate the distribution of the columns over the different dates.
Like in date_crawled column we want to know when the site crawled and what is the frequency for each day or during the month is it consistent or intermittent etc.

We want to include the missing value because we want to know the distribution for each day of site crawler and percentage give more clear picture for distribution rather than numbers.
It’s just data exploration.

Hope this helps!

Hi @harsh.raizada Thank you for your explanation.

I am sorry if my question is simple, but it is very important for me to analyze data. What does “crawling website” mean?

I found complex explanations on web such it is that software indexes web pages… and so on. But does it mean in this ebay cars’ set of data.

Thank you

May I ask you, when you referred to: autos[‘date_crawled’].str[:10] #Because the date is only till 9th index, I am not quite understand. Is’t it the initial count of the rows were 50,000. Why the index is only up to 9th??

1 Like

Sorry, I think I got it now. It’s actually referring to the number of string characters, like 2016-03-07 consist of 10 characters. Thank you.