Problem with understanding backreferences in Advanced RegEx

Hi,
I’m learning about regex and backreferences. the lesson as linked below is about finding the repeated words in titles.
Screen Link:
https://app.dataquest.io/m/369/advanced-regular-expressions/6/backreferences-using-capture-groups-in-a-regex-pattern
the lesson is about finding repeated words in a series by these criteria:

  • We’ll define a word as a series of one or more word characters preceded and followed by a boundary anchor.
  • We’ll define repeated words as the same word repeated twice, separated by a single whitespace character.

when I tried to solve the code using backreference(\1), I put in the code below:

pattern = r"(\b\w+\b)\s\1"
repeated_words = titles[titles.str.contains(pattern)]

but this pattern couldn’t find the repeated words. when I looked at the answer, it was the pattern below:

pattern = r"\b(\w+)\b\s\1\b"
repeated_words = titles[titles.str.contains(pattern)]

the pattern works and separates the repeated words correctly, first I thought I shouldn’t have included boundary anchors in my capture group but then I tried this pattern:

pattern = r"(\b\w+\b)\s\1\b"
repeated_words = titles[titles.str.contains(pattern)]

and it worked!!
my question is why do I have to put a boundary anchor after my backreference, when the boundary anchor was already included in my capture group?

Hey, Harati.

I admittedly didn’t read your post with 100% focus, but I think I know exactly what piece of knowledge you’re missing and you can find it in this post about this very same screen.

Thank you @bruno, I had the same problem, can you also answer me this? what is the character “/” considered in regex? is it included in “\w” cause the definition of “\w” is really vague, it seems to include everything.
and also in this pattern:

pattern = r"https?://([\w\-\.]+)"

why is there a “\” before the dot and slash symbols after “\w”. if we don’t include it behind the dot does it extract all chars except line break? is it used to cancel it’s predetermined effect? what about “-”.

It’s vague because it depends on local (or rather locale) settings. It’s supposed to capture word characters. In en_US, it probably is something like [a-zA-Z0-9], but if you include a latin language, then it will includes characters like: é, ã, and so on.

The backslash tells the regex engine to ignore the symbol’s special meaning. You learned that . matches every character except the newline character. What if you want to use . simply as a full stop?
In many cases, you need to “escape” it by preceding it by a backward slash.

In this particular instance, the backward slash is not necessary for .. See this to learn why. It’s still necessary for -, because it has a special meaning (it creates ranges).

Thank you very much for your explanation. really appreciate It.