I am working on the guided project: Clean and Analyze Employee Exit Surveys
These are examples of the values from which I need to extract a year represented with one number only:
Less than 1 year
1-2
3-4
5-6
11-20
More than 20 years
This is the code from dataquest solutions notebook:
combined_updated[‘institute_service_up’] = combined_updated[‘institute_service’].astype(‘str’).str.extract(r’(\d+)’)
combined_updated[‘institute_service_up’] = combined_updated[‘institute_service_up’].astype(‘float’)
The result:
1.0
3.0
5.0
11.0
21.0
So, how exactly did str.extract(r’(\d+)’) work in this case?
Next, the actual regex pattern: (\d+) — the pair of parentheses are what’s called a capturing group. It indicates to the .extract() method that it should pull out the patterns it finds that matches the expression inside the parentheses. \d is a special character class – it matches any digit(0-9) in the string pattern. + is what’s called a quantifier — it tells our method to find anything that matches 1 or more of the preceding token.
So, when put together, (\d+) means that we’re capturing any groups of digit characters in the string. Without the +, we’d still be capturing all the digits, but since \d represents just a single character, we’d be pulling a bunch of single character groups instead.
For example, if I had a string ‘123 456 789’:
(\d+) would match 3 groups: 123, 456, and 789. (\d) would match 9 groups: 1,2,3,4,5,6,7,8,9
Does that make sense? Again, it’s covered in a later course, so I wouldn’t worry too much about the syntax just yet.