The example given by DQ shows that the ‘match’ values “indicate the order the match appeared in the original dataframe”.
What exactly does this mean?
Thank you the help in advance!
The example given by DQ shows that the ‘match’ values “indicate the order the match appeared in the original dataframe”.
What exactly does this mean?
Thank you the help in advance!
please also mention screen link it would be helpful to understand query.
Hi @yijiyap,
I’m assuming you’re referring to the "match"
column that has integer values.
The way I understand this, although I may be wrong, is that we’re using the str.extractall
method in this example (as opposed to the str.extract
) method. The documentation for str.extractall
can be found here while the documentation for str.extract
is here.
The str.extract
method is described as follows:
Extract capture groups in the regex pat as columns in a DataFrame.
For each subject string in the Series, extract groups from the first match of regular expression pat.
On the other hand, 'str.extractall` is described as:
Extract capture groups in the regex pat as columns in DataFrame.
For each subject string in the Series, extract groups from all matches of regular expression pat. When each subject string in the Series has exactly one match, extractall(pat).xs(0, level=’match’) is the same as extract(pat).
The difference between the two is that str.extractall
captures/extracts all instances that match the regex pattern while str.extract
only captures/extracts the first instance of the match.
Let’s take this example code:
my_strings = pd.Series(["I own 15 cats and 8 dogs.", "Our house has 2 bedrooms."])
pattern = r"(?P<numbers>\d+)"
extracted = my_strings.str.extract(pattern)
extractedall = my_strings.str.extractall(pattern)
If we examine our two new variables, extracted
and extractedall
, we will see the following:
For extracted
:
numbers | |
---|---|
0 | 15 |
1 | 2 |
For extractedall
:
match | numbers | |
---|---|---|
0 | 0 | 15 |
1 | 1 | 8 |
1 | 0 | 2 |
You may notice that extractedall
has an extra "match"
column since it captures all instances of our regex pattern (\d+
) and stores them in a dataframe. The values in the "match"
column indicate the order in which the values were extracted from our series.
Note that the index numbers still follow the zero-based indexing convention used in pandas/Python.
Hope this clears it up!
oh wow thank you very much!