Workflow is mentioned in the exercise tutorial, workflow is like an approach to attain a particular task. So, here the task is to get the sorted distribution of the values in a percentage for date_crawled , ad_created , and last_seen columns.
To explain, I am taking only one column and the same procedure you can apply to rest of the columns.
We know that all the columns are strings, So in date_crawled column we need to take dates distribution for which,
First, we have to extract/parse only date from the each row of string in date_crawled column Code: autos['date_crawled'].str[:10]#Because the date is only till 9th index
Then, you have to see the distribution/frequency in percentages which you can achieve with: .value_counts(normalize=True, dropna=False) # normalize= True will give you percentage and False will give only counts. So, now code will be: autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False)
Next you have to sort the distribution in ascending order which you can achieve with .sort_index()#Here you are sorting values in reference of index. So, final code will be: autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
Now, you can repeat the same steps on rest of the columns.
Hope this helps!
Same like everyone i was stuck here but thank you for your explanation.
but I still have doubt here.
Why we are calculating the frequency/ percentage here for those columns.?
I dont understand purpose of this step here ?
To include missing values in the distribution and to use percentages instead of counts, chain the Series.value_counts(normalize=True, dropna=False) method.
We need this step to calculate the distribution of the columns over the different dates.
Like in date_crawled column we want to know when the site crawled and what is the frequency for each day or during the month is it consistent or intermittent etc.
We want to include the missing value because we want to know the distribution for each day of site crawler and percentage give more clear picture for distribution rather than numbers.
It’s just data exploration.
May I ask you, when you referred to: autos[‘date_crawled’].str[:10]#Because the date is only till 9th index, I am not quite understand. Is’t it the initial count of the rows were 50,000. Why the index is only up to 9th??