Why do you use a named capturing group with regexp?

Screen Link:

My Code:

pattern = r"(?P<Years>[1-2][0-9]{3})"
years = merged['IESurvey'].str.extractall(pattern)
value_counts = years['Years'].value_counts()
print(value_counts)

I’m not sure if I understand the explanation right from the exercise:

Using a named capturing group means that we can refer to the group by the specified name instead of just a number

Because we use the pattern: ?p<years> in the regexp. i can perform a value_counts method on the column years['years'] from the object years?

Hi @jeroenstikkelorum,

When you create named capturing group, in this case Years ( because of this code ?P<Years> ) a new column will be created with this name and the matches extracted will be saved under this name. Hence we can access it by calling the name of the named capturing group like years['Years'].

When you call years['years'] you can access all the patterns and use value_counts() just like on any column. I hope this helps.

2 Likes

Hi @jithins123

thanks for your clear explanation.

A few new questions that came to mind:

  • Why would you use a capturing group?
  • When would you use a capturing group?
  • If you use the pattern with another object for example:

pattern = r"(?P<Years>[1-2][0-9]{3})"
visitors = merged['IESurvey'].str.extractall(pattern)

extra column years will be added to the object: visitors?

Capturing group makes it easier to capture particular text or number from specific column that’s matched by regex defined in pattern variable. Since you want to extract specific string or number and then place the result in a new DataFrame with column named just like done for years variable.

Again, capturing is required if you want to capture specific info from column where you don’t want to extract everything. In this case, we want to capture only year in IESurvey column under merged dataset rather than capturing everything like “Expenditure survey/budget survey (ES/BS), 2004”. After capturing year by matching regex defined in pattern then we can set in a new variable years and then sort all counts by years in categorically.

visitors is defined as another variable so another result will be a DataFrame with one column created ‘Years’ just like you have initialised other variable named years. Again visitors and years will be separate variables with separate results.

Hope this helps.