Guided Project:Clean And Analyze Employee Exit Surveys

Hi there, I got stuck on the guided project for Clean and Analyze Employee Exit Surveys page 9 of 11. specifically on doing extraction for the institute_service column, where the column contains 4 types of string patterns.

NaN 164
Less than 1 year 77
3-4 72
1-2 68
11-20 49
More than 20 years 48
5-6 36
7-10 30
5.0 30
3.0 28

I tried to use multiple patterns regex with ‘|’ (OR symbol) to extract all and obtained 4 columns of data. By having 4 columns extracted, resulted into another challenge of combining them into a column. code used below.

 pat =  r"(?P<one>[0-9])-(?P<two>[0-9][0-9]?)|More than (?P<more>[0-9][0-9]?) years|Less than (?P<less>[0-9]?) year|(?P<single>[0-9])"
 combined_updated_test = combined_updated['institute_service'].astype(str).str.extractall(pat)

The question: is there another way to extract these multiple patterns effectively?

1 Like

Hey @IamTry,
Thanks for sharing your question.
I can share my experience with this specific task. I feel that maybe you might have overcomplicated slightly. The data in the institute_service column either contains year range or a number with a string. For the purposes of the task at hand and to keep it simple I just used the following code to extract the number and that’s about it. I took the first number where there was a range but you could take an average as well. Ultimately the idea is to end up with one column.
years_of_service = combined_updated['institute_service'].astype('str').str.extract(r'(\d+)', expand=False).astype(float)
Hope this helps

1 Like

Hi @kamranmk,

Thanks for responding to the question. You are right! I have overcomplicate the regex pattern. I got confused with the concept as this guided project was done before I learnt about regular expression – apparently taught in the subsequent missions!

Thanks for the code recommendation, it works now! :slight_smile:

Cheers!

You are welcome @IamTry. Happy to help :blush:

I actually had a different solution. I used the vectorized string operators that I already learned about in previous missions. I feel pretty confused about what the “\d+” means in the guided solution, but since you said this is discussed in the later missions perhaps I will learn about this later.

Here was my solution:

#First change the type to str in the institute_service column

combined_updated[‘institute_service’] = combined_updated[‘institute_service’].astype(str)

#Then clean the institute_service column

combined_updated[‘institute_service’] = (combined_updated[‘institute_service’]
.str.split("-").str[-1]
.str.replace(“Less than”,"")
.str.replace(“More than”,"")
.str.replace(“year”, “”)
.str.replace(“s”,"")
.str.strip()
)

#Change the type to float

combined_updated[‘institute_service’] = combined_updated[‘institute_service’].astype(float)

@znarkin This solution would probably work as well. \d+ simply means match any set of numbers. This will always take the first match