Guided Project 7: How to varify the years in the 'crease_date' and 'dete_start_date'

Hi everyone,

I’m at step 5 (Verify the Data) of the project (Clean and Analyze Employee Exit Surveys), where I’m struggling to comprehend how do we figure out the years in the cease_date and dete_start_date columns that make sense:

  • Since the cease_date is the last year of the person’s employment and the dete_start_date is the person’s first year of employment, it wouldn’t make sense to have years after the current date.
  • Given that most people in this field start working in their 20s, it’s also unlikely that the dete_start_date was before the year 1940.

Could anyone please elaborate on what that current date means in the first point and the second whole point is beyond my understanding?

Thank you!

Excellent question!

I agree with you that the content text could have been clearer. And I think what’s also important to note is what year the data was collected. Based on the data source provided, ~2014 is a reasonable assumption as per me.

I think the current date refers to our current year. What they are trying to let us know is that one of the ways to verify date-related data is to check the data in relation to when it was collected and our current/today’s date.

For example, if the data was meant to be collected in 2014 it might be possible that we have start and end dates which are set in the future (relative to 2014). Someone might be joining later for some reason or someone might be set to leave a certain year in the future. But, a lot of such values in the dataset (or start end dates set to years after 2014) impact our analysis. We would end up analyzing our data based on things that haven’t actually happened and it might not fit in with what we are trying to analyze.

If the data was collected in 2014, then 1940-2014 is 70+ years. So, if someone’s first year of employment was in 1940, let’s say, they would be in their 90s assuming they were in their 20s when they started. It’s unlikely that anyone started before 1940.

Honestly, I think 1940 might be an underestimation here. Mid to late 1950s might be a better threshold.

But, the overall idea is to think about these dates more critically within the context provided. It could have been phrased better, though. The above is my understanding of it, but I could be mistaken if the content’s intent was different.

1 Like

@the_doctor thank you so much your explanation is beneficial to get a sense out of these points.

1 Like