Extract year from 'int64' using string methods or pd.to_datetime()

Hi everyone, I am completing the guided project ‘Clean and Analyze Employee Exit Surveys’ in the Dataquest ‘Data Analyst with Python’ and I’m stuck on a seemingly minor issue.

Screen Link:

I am working with a dataset called ‘dete_resignations’ which contains a column called ‘cease_date’. The column is of type int64 and it contains entries with the format ‘YYYY’ and other entries with the format ‘MM/YYYY’.

My aim is to extract the year from this column.

My first ideas was to identify a regex ‘MM/YYYY’ and then select the year part:

dete_resignations['cease_date'].str.extract(r"[0-1][0-9]/[1-2][0-9]{3}")

If I try to do this, I get an error: "Can only use .str accessor with string values! " I suppose I cannot use .str.extract because my type is ‘int64’.

So I move to my second idea: let’s convert the data to datetime with pd.to_datetime(), then extract the year using dt.year.

pd.to_datetime(dete_resignations['cease_date'], unit = 's').dt.year.astype(float)

If I try to do this, the code runs, but as a result I get the year ‘1970’ for all entries. I looked up on Stack Overflow and they said this issue could be solved by setting the parameter unit = ‘s’ but it does not work for me. By printing intermediate results, I managed to understand that the problem comes not when I extract the year, but when I turn it into float.

At this point, I try to use the proposed solution…which is the following:

so I cut and past the code:

dete_resignations['cease_date'] = dete_resignations['cease_date'].str.split('/').str[-1]
dete_resignations['cease_date'] = dete_resignations['cease_date'].astype("float")

…and again I end up with the error ‘Can only use .str accessor with string values!’ like in my first approach. I checked in the solution file but it doesn’t seem to me like the authors did something with the data, like converting it to a string type, before these lines.

Please help! I’m really struggling! I’d like to know:

  1. Is it possible to use string methods, both ‘str.extract()’ and ‘str.split()’ if my data is int64? How can I do it?
  2. Why do I get ‘1970’ when I try to turn the year I extracted from datetime into float?

Thank you everyone…

1 Like

I can see ‘YYYY’ as an int but ‘MM/YYYY’ shouldn’t be an int.

Have you tried typecasting the series into a string with astype(str)?

If I try to do this, the code runs, but as a result I get the year ‘1970’ for all entries. I looked up on Stack Overflow and they said this issue could be solved by setting the parameter unit = ‘s’ but it does not work for me. By printing intermediate results, I managed to understand that the problem comes not when I extract the year, but when I turn it into float.

I don’t exactly know how pandas work behind the scene, but from my experience, we normally convert a string into a datetime. Other than that, even if an int is typecasted into a string, you probably have to understand how pandas parses the string as a date, so you can be sure that it produces the correct datetime. For example, the string ‘2000’ doesn’t have as much information as ’ 10/11/12’.

It would be helpful if you can share the .value_counts() for the problematic column so we can see the data format.

Hi, first of all, thanks for trying to help!

I think you are right, it was a string from the beginning, indeed if I do

dete_resignations['cease_date'].dtypes

I get:
dtype(‘O’)

I mistakenly wrote it was an int64 because that is what I read at the end of the value_counts, but now I understand it does not refer to the dtype of the column…first lesson learned here!

If I print values for this column I get:

dete_resignations['cease_date'].value_counts()

2012 126
2013 74
01/2014 22
12/2013 17
06/2013 14
09/2013 11
11/2013 9
07/2013 9
10/2013 6
08/2013 4
05/2012 2
05/2013 2
2010 1
07/2012 1
07/2006 1
09/2010 1
Name: cease_date, dtype: int64

Now, if my column is indeed of type string, I should be able to use both str.extract and str.split like I tried.
If I do that, I still get the error ‘Can only use .str accessor with string values!’

After some brainsqueezing, I think I’ve figured out why: the code actually works, my mistake was that I ran it multiple times in Jupyter and of course, after converting it to float the first time, it cannot perform the .str.split() again. I feel very stupid about this, I’ve only been doing this for 10 days and still getting familiar with the tools :cold_sweat: If I do ‘run all above’ before repeating this step, everything works fine.

And unfortunately, I did the same mistake in the other case! I just found out:

Let’s say I want to use pd.to_datetime and then dt.year instead of using str.split or str.extract. So I revert back to the beginning , back to my original column which is a string, and I do:

dete_resignations['cease_date'] = pd.to_datetime(dete_resignations['cease_date']).dt.year
dete_resignations['cease_date'].dtypes

I get: dtype(‘float64’) which I suppose is the year extracted from the datetime object. And:

dete_resignations['cease_date'].value_counts()

I get:
2013.0 146
2012.0 129
2014.0 22
2010.0 2
2006.0 1

which is correct. So where did 1970 came from? It was my fault, because again I ran both lines of code twice, the ‘to_datetime.dt.year’ together with the ‘value_counts’, and of course after extracting the year and turning it into float I could not extract it again from a float value, and that’s where the weirdness came from.

In conclusion:

  • The data was actually of string type, to get the type I need to run ‘dtypes’ and NOT reading the type written at the end of value_counts, which is always int64.
  • The code was ok, and all three options proposed worked just fine
  • What did not work fine was my ability to operate Jupyter notebook: if I ran the cells multiple times, of course I get broken code.

Thank you for your answer, which helped me re-examine everything and understand where the problem was, and I hope this lesson will stick with me and hopefully help other newbies :sweat_smile:

1 Like

Haha yeah it’s easy to be trapped by Jupyter with regards to the order you run your cells. It’s more confusing if you’re used to run everything using an IDE where you have to rerun things all the time, so there’s no persistent state like in Jupyter.

To add to your conclusions, in case of value_counts(), it’s a frequency table so the values of the columns make up the index while the frequency (int64) is the value. It’s easy to get tripped up for sure because it’s not simply a Series out of the dataframe.

And one thing I learned is to try to avoid changing the original dataframe (aka mutating it). It’s tedious, but after any big data transformation, it is worthwhile to create a copy of a dataframe under a different variable.

Anyhow, I’m glad that you got it all figured out. Keep on learning.

You are absolutely right. This is something I need to practice on, because I always tend to work on the original dataframe because ‘if I get it wrong, I can always CTRL+Z my way back to the working status’ :sweat_smile: but building good basic habits goes a long way. This was truly useful to me! Thanks again

1 Like

No worries.

Glad I can be of help.