What does the column 'match' in extractall() mean

The example given by DQ shows that the ‘match’ values “indicate the order the match appeared in the original dataframe”.

What exactly does this mean?

Thank you the help in advance!

1 Like

please also mention screen link it would be helpful to understand query.

1 Like

oops sorry

1 Like

Hi @yijiyap,

I’m assuming you’re referring to the "match" column that has integer values.

The way I understand this, although I may be wrong, is that we’re using the str.extractall method in this example (as opposed to the str.extract) method. The documentation for str.extractall can be found here while the documentation for str.extract is here.

The str.extract method is described as follows:

Extract capture groups in the regex pat as columns in a DataFrame.

For each subject string in the Series, extract groups from the first match of regular expression pat.

On the other hand, 'str.extractall` is described as:

Extract capture groups in the regex pat as columns in DataFrame.

For each subject string in the Series, extract groups from all matches of regular expression pat. When each subject string in the Series has exactly one match, extractall(pat).xs(0, level=’match’) is the same as extract(pat).

The difference between the two is that str.extractall captures/extracts all instances that match the regex pattern while str.extract only captures/extracts the first instance of the match.

Let’s take this example code:

my_strings = pd.Series(["I own 15 cats and 8 dogs.", "Our house has 2 bedrooms."])
pattern = r"(?P<numbers>\d+)"
extracted = my_strings.str.extract(pattern)
extractedall = my_strings.str.extractall(pattern)

If we examine our two new variables, extracted and extractedall, we will see the following:

For extracted:

numbers
0 15
1 2

For extractedall:

match numbers
0 0 15
1 1 8
1 0 2

You may notice that extractedall has an extra "match" column since it captures all instances of our regex pattern (\d+) and stores them in a dataframe. The values in the "match" column indicate the order in which the values were extracted from our series.

  • 15 was the first instance (match index 0) extracted from the first string (series index 0)
  • 8 was the second instance (match index 1) from the first string (series index 0)
  • 2 was the first instance (match index 0) from the second string (series index 1)

Note that the index numbers still follow the zero-based indexing convention used in pandas/Python.
Hope this clears it up!

2 Likes

oh wow thank you very much!