Advanced Regular Expressions : 6. BackReferences: Using Capture Groups in a RegEx Pattern

For this particular task the pattern used to identify repeated words is
pattern = r"\b(\w+)\s\1\b"

Why do we need to use word boundary (\b) on both the sides in this pattern?
Why can’t we use the below pattern to get the same result?
pattern = r"(\w+)\s\1"

In the below video, it is explained how to use back reference to identify repeated words and here the pattern does not use word boundary (\b). Back references topic starts at 13:40.

Please provide your inputs.

When using the pattern r"(\w+)\s\1", here are a few of the titles that came up as a match:

Fundraising Advice for YC Companies
Ask HN: Someone offered to buy my browser extension from me. What now?
Bitbucket: Support GitHub-style pages for repositories and teams
Researchers have trained a machine to spot depression on Instagram
US State Dept Issues Worldwide Travel Alert

None of these have any repeated words, so what is the pattern matching? If you copy and paste these entries into regexr.com and use the pattern, it highlights the following parts:

Since we have \w+, it’s looking for any number of alphanumeric characters that are repeated with a space between them. Because of this we get all kinds of matches we don’t want when the end of one word and the beginning of the next word are identical. By putting the word boundaries, we’re ensuring that it’s matching whole words instead of parts of words.

I hope that helps!

3 Likes

Thank you for the explanation.