Hi there, I got stuck on the guided project for Clean and Analyze Employee Exit Surveys page 9 of 11. specifically on doing extraction for the institute_service column, where the column contains 4 types of string patterns.
Less than 1 year 77
More than 20 years 48
I tried to use multiple patterns regex with ‘|’ (OR symbol) to extract all and obtained 4 columns of data. By having 4 columns extracted, resulted into another challenge of combining them into a column. code used below.
pat = r"(?P<one>[0-9])-(?P<two>[0-9][0-9]?)|More than (?P<more>[0-9][0-9]?) years|Less than (?P<less>[0-9]?) year|(?P<single>[0-9])"
combined_updated_test = combined_updated['institute_service'].astype(str).str.extractall(pat)
The question: is there another way to extract these multiple patterns effectively?
Thanks for sharing your question.
I can share my experience with this specific task. I feel that maybe you might have overcomplicated slightly. The data in the institute_service column either contains year range or a number with a string. For the purposes of the task at hand and to keep it simple I just used the following code to extract the number and that’s about it. I took the first number where there was a range but you could take an average as well. Ultimately the idea is to end up with one column.
years_of_service = combined_updated['institute_service'].astype('str').str.extract(r'(\d+)', expand=False).astype(float)
Hope this helps
Thanks for responding to the question. You are right! I have overcomplicate the regex pattern. I got confused with the concept as this guided project was done before I learnt about regular expression – apparently taught in the subsequent missions!
Thanks for the code recommendation, it works now!
You are welcome @IamTry. Happy to help
I actually had a different solution. I used the vectorized string operators that I already learned about in previous missions. I feel pretty confused about what the “\d+” means in the guided solution, but since you said this is discussed in the later missions perhaps I will learn about this later.
Here was my solution:
#First change the type to str in the institute_service column
combined_updated[‘institute_service’] = combined_updated[‘institute_service’].astype(str)
#Then clean the institute_service column
combined_updated[‘institute_service’] = (combined_updated[‘institute_service’]
#Change the type to float
combined_updated[‘institute_service’] = combined_updated[‘institute_service’].astype(float)
@znarkin This solution would probably work as well. \d+ simply means match any set of numbers. This will always take the first match