Capture group -- and capture a single character?

I have trouble in understanding the capture group concept here. Thank you for helping in advance!!

when i do

pattern="\[\w+\]"
tag_titles=titles[titles.str.contains(pattern)] , it returns the whole title line which contains the pattern. If I add () to the pattern, return is the same, pattern="(\[\w+\])"
tag_titles=titles[titles.str.contains(pattern)] so what does () really do? capture what group here?

When I try to extract the pattern from series, pattern = "\[\w+\]"
tag=titles.str.extract(pattern), error shows ValueError: pattern contains no capture groups

When I try to do pattern = "(\[\w+\])" tag=titles.str.extract(pattern), It returns:

tag_freqSeries (<class ‘pandas.core.series.Series’>)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN

20094 NaN
20095 NaN
20096 NaN
20097 NaN
20098 NaN
Name: title, Length: 20099, dtype: object

When I do pattern = "(\[\w+\])" tag_freq=titles.str.extract(pattern).value_counts(), it returns
title
[pdf] 276
[video] 111
[audio] 3

if i move the () inside like pattern = "\[(\w+)\], then return tag name without [], to me, i did not see capture group function, it seems that () only affect how the pattern shows?

`

2 Likes

hi @candiceliu93

Have you been able to solve/understand this by now?
In any case, apologies for the delayed response. I get :cold_sweat: when it comes to regex, for that matter a lot of other things in Data Science :grimacing:

"()" denotes a capture group, as in, what ever part of the string/ text you want to search/ extract, it will be inside the "()" brackets. And whatever lies outside the () will not be included.

This pattern looks for a word which is contained within square brackets. I am not sure if you have understood the difference between "\[ <some text> \]" and "[ <some text> ]". The latter, will give regex a set of characters to search for.

For example: Let’s take this dummy data frame:

df = pd.DataFrame(["sample text 1", "sample [text] 2", "sample text three", 
                  "[sample] [text] 4", "sample text [5]"], columns = ["dummy"] )
df.index = ["row " + str(i + 1)for i in range(len(df["dummy"]))]
df

I have named the index, for ease of discussion. The data frame looks like this:
image

If we just provide the pattern - "(\w+)", it gives the following results:

df["dummy"].str.extract("(\w+)")

image

However when we enclose the \w+ in a capture group with accompanied by “[” and “]” the square brackets are included as part of the text to search and extract.

df["dummy"].str.extract("(\[\w+\])")

image

Observe in row 4, there are two text parts enclosed in "[]". So we need to provide two capture groups if we want to capture them.
(As I confessed (sheepishly) that even I am far far far away from really grasping the regex universe, there must be a better and optimized code available to do this. Please do let me know as well, if you know or have found one! :slight_smile: )

df["dummy"].str.extract(r"(\[\w+\])\s(\[\w+\])")

However this only gives us the 4 the row.
image

And when I try this code, it includes row 2 and row 5, but omits row 4.

df["dummy"].str.extract("(\w+)\s(\[\w+\])")
image

The reason being, the pattern I want to search looks like this: "a word followed by a space then [ inside this a character/ word then ]"

Do let me know, if I caused more confusion rather than help.

2 Likes

Hi @Rucha
I read your reply twice carefully. Thank you for your detailed explanation. It did solve part of my questions. Thank you!!

I am still not sure that I understand the capture group concept here. ( ) is to capture a group. The group here means that return multiple words as a group or multiple rows as a group? If I do pattern="(\w+\s\w+)" in your example, it will return all words from each row because each row matches the pattern. Am I correct?

[ ] means ''or ''not ‘‘and’’. If I do pattern="[TtSs]ext" it means that as long as those letters are followed by ‘‘ext’’, it will be returned,so in your example, only ‘‘text’’ in each row will be returned. but I tried this pattern="([\w+])"? Only ‘‘s’’ return in each row? how does it work? I use (). Shouldn’t it return worlds (capture group) instead of a letter?

I used your df dataset, try below. I found row 4 only reture [sample], why does not show [5] as well?
pattern="(\[\w+\])"
df["dummy"].str.extract(pattern)

output:
row 1 NaN
row 2 [text]
row 3 NaN
row 4 [sample]
row 5 [5]

I also tried below. But errors show pattern contains no capture groups. why can’t it work?
pattern="[\w+]"
```df[“dummy”].str.extract(pattern)``

1 Like

Hi @candiceliu93

Okay… this response contains some home work for you! :stuck_out_tongue:

Yup, till the time pattern is an exact match, where a character class is followed by a space which is followed by a character class. So texts like - “sample text” would match and texts like
" s a " would also match. The extra spaces here won’t matter as the regex is able to identify and match "s a" part here.
However, if we increase the space between s and a by more than 1, the pattern will fail to match. (Try this and let me know)

When we code "\w", the search starts for a character class This includes Upper and Lower case characters and numbers between 0 and 9. The "+" next to "\w" as in '\w+' doesn’t look for the same word/character, but rather repeats the search for the entire character class itself. This page has a detailed explanation. But the catch here is the extract method will give us the first occurrence of the pattern match.
So essentially we are saying “[Find some alphanumeric character OR Find some alphanumeric character]” which is the same thing. So it found something with the first match itself and gave the first character it found.
Try with this modification pattern="([\w]+)" and observe the results.

This again is due to the first successful match. For row 4, it happens with [sample] itself. but observe row 2 and row 5. In row "sample" does not match "[ alphanumeric ]" so it moves ahead then comes space then comes "[text]" it matches and hence is extracted.
In row 5 the search continues till it reaches [5], and hence in row 5 [5] is extracted as output.

Add 3 rows to the data frame and try this pattern again:

1. sample[ ]text [n]
2. sample [ ] text n
3. sample [_] text [n] / n 

The [n] / n here means its optional and you can substitute n with the sequence number of the row if you want.

I don’t know if I understood this correctly, but the pattern is checked for each row, row by row. and not for all rows in one go. So yes the entire string content in one row will be checked and matched for the pattern. The first successfully matched portion contained within () will be extracted.

The page you shared is so useful! It explained so detail.

However, if we increase the space between s and a by more than 1, the pattern will fail to match. (Try this and let me know)----I put one more space, it shows NaN, which make sense.

Try with this modification pattern="([\w]+)" and observe the results.----I tried this one. for my understanding, [ ] means looking for a character class.If i want to look for a word from each line, I have to do '([\w]+)',putting + out of [ ]

And all search is to match the first match only. If the first word/character does not match the pattern we are looking for, then it will skip to the next words until it finds the exact match then it will stop.

Thank you!!