Str.extract(r'(\d+)') - can someone explain how this code works?

I am working on the guided project: Clean and Analyze Employee Exit Surveys
These are examples of the values from which I need to extract a year represented with one number only:

Less than 1 year
More than 20 years

This is the code from dataquest solutions notebook:
combined_updated[‘institute_service_up’] = combined_updated[‘institute_service’].astype(‘str’).str.extract(r’(\d+)’)
combined_updated[‘institute_service_up’] = combined_updated[‘institute_service_up’].astype(‘float’)

The result:

So, how exactly did str.extract(r’(\d+)’) work in this case?

It’s best to break this down into pieces:

  • First the, r' ' syntax — this is called Raw String Notation, and allows you to use regex without having to escape backslashes. I wouldn’t worry too much about this just yet, as we’ll go over it in a future course, but if you’re interested, you can read more about it here:

  • Next, the actual regex pattern: (\d+) — the pair of parentheses are what’s called a capturing group. It indicates to the .extract() method that it should pull out the patterns it finds that matches the expression inside the parentheses. \d is a special character class – it matches any digit(0-9) in the string pattern. + is what’s called a quantifier — it tells our method to find anything that matches 1 or more of the preceding token.

So, when put together, (\d+) means that we’re capturing any groups of digit characters in the string. Without the +, we’d still be capturing all the digits, but since \d represents just a single character, we’d be pulling a bunch of single character groups instead.

For example, if I had a string ‘123 456 789’:

(\d+) would match 3 groups: 123, 456, and 789.
(\d) would match 9 groups: 1,2,3,4,5,6,7,8,9

Does that make sense? Again, it’s covered in a later course, so I wouldn’t worry too much about the syntax just yet.


Worth mentioning that str.extract() only returns the first match — str.extractall() returns all matches.


Yes, this is very helpful, thanks!

Thanks dustinako,
Great and very helpful explanation!