Str.extract(r'(\d+)') - can someone explain how this code works?

I am working on the guided project: Clean and Analyze Employee Exit Surveys
These are examples of the values from which I need to extract a year represented with one number only:

Less than 1 year
1-2
3-4
5-6
11-20
More than 20 years

This is the code from dataquest solutions notebook:
combined_updated[‘institute_service_up’] = combined_updated[‘institute_service’].astype(‘str’).str.extract(r’(\d+)’)
combined_updated[‘institute_service_up’] = combined_updated[‘institute_service_up’].astype(‘float’)

The result:
1.0
3.0
5.0
11.0
21.0

So, how exactly did str.extract(r’(\d+)’) work in this case?

It’s best to break this down into pieces:

  • First the, r' ' syntax — this is called Raw String Notation, and allows you to use regex without having to escape backslashes. I wouldn’t worry too much about this just yet, as we’ll go over it in a future course, but if you’re interested, you can read more about it here: https://stackoverflow.com/questions/12871066/what-exactly-is-a-raw-string-regex-and-how-can-you-use-it

  • Next, the actual regex pattern: (\d+) — the pair of parentheses are what’s called a capturing group. It indicates to the .extract() method that it should pull out the patterns it finds that matches the expression inside the parentheses. \d is a special character class – it matches any digit(0-9) in the string pattern. + is what’s called a quantifier — it tells our method to find anything that matches 1 or more of the preceding token.

So, when put together, (\d+) means that we’re capturing any groups of digit characters in the string. Without the +, we’d still be capturing all the digits, but since \d represents just a single character, we’d be pulling a bunch of single character groups instead.

For example, if I had a string ‘123 456 789’:

(\d+) would match 3 groups: 123, 456, and 789.
(\d) would match 9 groups: 1,2,3,4,5,6,7,8,9

Does that make sense? Again, it’s covered in a later course, so I wouldn’t worry too much about the syntax just yet.

7 Likes

Worth mentioning that str.extract() only returns the first match — str.extractall() returns all matches.

2 Likes

Yes, this is very helpful, thanks!

Thanks dustinako,
Great and very helpful explanation!