Parentheses needed when extracting substrings from a series

In this lesson, we are to replace

pattern =r"(your_answer)"

with a combination of expressions in order to extract years from a particular column in a dataframe. The following code

pattern =r"([1-2][0-9]{3})"

works only if I include the parentheses. If I remove them, I get ValueError: pattern contains no capture groups. The examples listed in the lesson did not show the use of parentheses. Are these always necessary? Is there a tutorial that explains how to extract characters from strings using patterns? All I have been able to find is technical documentation or very basic explanations that don’t answer the question of why the parentheses are necessary.

Edit: The next lesson (a little late IMO) explains why the parentheses were included stating

The parentheses indicate that only the character pattern matched should be extracted and returned in a series.

I now understand why to include parentheses but can’t understand why not having them caused a ValueError.

It gives you an error because pandas.Series.str.extract requires capture groups (the bold italics below are mine):

Series.str.extract(pat, flags=0, expand=True )[source]

Extract capture groups in the regex pat as columns in a DataFrame.

For each subject string in the Series, extract groups from the first match of regular expression pat.

Parameters:   pat  :   str

                                       Regular expression pattern with capturing groups.

You’re using regular expressions (which is its own world) inside pandas (another world) inside Python (yet another world).

To use regular expressions with strings directly (not with Series), you can just use the re module:

>>> import re
>>> s = (
...     "Nineteen Eighty-Four (also published as 1984) is a dystopian "
...     "social science fiction novel and cautionary tale by English "
...     "writer George Orwell. It was published on 8 June 1949 by "
...     "Secker & Warburg as Orwell's ninth and "
...     "final book completed in his lifetime."
... )
>>> pattern = "\d{4}"
>>> re.findall(pattern, s)
['1984', '1949']
>>> re.match(pattern, s)
>>> re.search(pattern, s)
<re.Match object; span=(40, 44), match='1984'>
2 Likes