What is the named capturing group for?

In the working with strings mission, it uses a named capturing group in the regex pattern.
Why doesn’t the code work when I remove the named capturing group and use a normal capturing group, like so:

pattern = r"([1-2][0-9]{3})"
years = merged['IESurvey'].str.extractall(pattern)
value_counts = years['Years'].value_counts()
print(value_counts) 

Link to mission: https://app.dataquest.io/m/346/working-with-strings-in-pandas/9/extracting-all-matches-of-a-pattern-from-a-series

1 Like

Hello @56anna,

I think the regex docs provides some answer:

Named groups behave exactly like capturing groups, and additionally associate a name with a group. (bolded by kakoori)

Additionally, you can retrieve named groups as a dictionary with groupdict() :

>>> m = re.match(r'(?P<first>\w+) (?P<last>\w+)', 'Jane Doe') 
>>> m.groupdict()
{'first': 'Jane', 'last': 'Doe'}

Named groups are handy because they let you use easily-remembered names, instead of having to remember numbers.

There’s also a cool example:

InternalDate = re.compile(r'INTERNALDATE "'
        r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
        r'(?P<year>[0-9][0-9][0-9][0-9])'
        r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
        r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
        r'"')

So it’s sort of similar to using pandas dataframe. You can access the column by specifying it’s name or index.
Named groups are just to make this name referencing possible.

Hope this helps.

1 Like

Hello! If named groups behave exactly like capturing group, then why is it that my code does not work?

When I run pattern = r"([1-2][0-9]{3})" I get TypeError: an integer is required

I ran and got this KeyError: 'Years'. Maybe you should post the code.

I’ve edited my post to show the full code

When you remove the name of the capturing group, the result is named as integer.

So in your case, years is a pandas.DataFrame with a column named 0.
That is why Python wants an integer.

In the above code, when

value_counts = years['Years'].value_counts()

is called, Python tries to fetch a column named ‘Years’ but there isn’t any. There is a column named 0, however, so modifying the code slightly:

value_counts = years[0].value_counts()

returns the correct result.

2 Likes

I ran your code and it gave me the same error: KeyError: 'Years'. However, I saw the this error TypeError: an integer is required as well. I think you should scroll down to the very end to read the type of error.

It is an indexing error when you tried to use year['Years']. This code below creates a dataframe for years with a column name Years. When you do not use the ?P<Years> the name of the column is zero as @kakoori explained.

pattern = r"(?P<Years>[1-2][0-9]{3})"
years = merged['IESurvey'].str.extractall(pattern)

1 Like

Thank you very much!

This helped a lot! I did not realise the type of error is given at the bottom - thank you!

1 Like