369-6 Backreference exercise

Advanced regular expressions

I understand the correct answer. But why was what I put in wrong?

pattern = r'(\b\w+\b)\s\1'
repeated_words = titles.loc[titles.str.contains(pattern)]

I am getting a lot more extra rows in repeated_words. such as:
"Google's self-driving car is the victim in a serious crash"

I don’t understand how it matches pattern.

Thank you for your help!

I suspect you possibly hold a misconception that I try to resolve on the only sub-bullet below.

Your regex matches Google's self-driving car is the victim in a serious crash in the s s portion, right after Google'. It works like this:

  1. The subpattern \b\w+\b matches s (the one right after Google') because it is preceded by ' and followed by a space — both these characters are non-word characters.
  2. The capture group captures s, which is the match described above.
    • Capture groups can only capture text; the regex wildcard \b is not text.
  3. The space that follows (\b\w+\b) will then match the space that follows Google's.
  4. And finally \1 will match s after Google's because that’s what the first group captured.

You can reproduce this behavior with simpler strings and patterns if you find that helpful:

  • Pattern: (\ba\b) \1
  • String: a al

The given pattern will match the given string for the same reasons as the above example.
image
image

Source.

8 Likes

Thank you for this explanation. I kept coming up with the same line of code that OP did and couldn’t figure out why it didn’t work since I had the \b in the capture group. After reading your answer, I must admit that I was a little frustrated. I went back and skimmed all the lessons on word boundaries and capture groups but didn’t find anything telling me that capture groups don’t capture text. Seems like something important to include, especially since I was trying different combinations of this regex for over half an hour because I assumed I was following all the regex syntax rules I had learned.

Thanks again!

2 Likes

Hey, Chris.

Thanks you for that feedback. I’ll pass it along to the team.

One thing I’ll say (without full context of what is taught in this course), is that it’s possible this doesn’t need to be stated explicitly, as it can be explained through other means.

The other means that I just mentioned, are what I explain in this post. I don’t know if we teach all of this in the course and your suggestion is noted.

2 Likes

@ChrisMatsuoka I also went through the same agony last night. I even posted the same question on Stack Overflow, only to get my question marked as duplicate. @Bruno’s answer saved my day :slight_smile: .

2 Likes

Isn’t it good to know you’re not alone? I know it’s definitely comforting to me!

1 Like

@Bruno - Hey Bruno. I agree with these folks. Can you have the regex course looked at? Very confusing exercises and the explanations aren’t really there in comparison to the remainder of the lessons. If it was up to me 30% more content pumped into this would mean a big difference.

these bumps break your confidence as you self-study and can be a huge waste of time.

For anyone interested… here’s a free course on Regex on udemy.

3 Likes

Thanks for insisting on this, Eu.

I’ll take this directly to my boss.

2 Likes

Hi Bruno, Could you please give the name of the site from where you provided the image. I am trying in RegExr but the explanations are looking better in the image. Thanks.

The source is right below the image in my reply :slight_smile:

1 Like

How can I extend this to capturing repeating phrases?

I’m new to regex. I want to capture the pattern x(x in a text, where x is any food ingredient, and could be comprised of a single word or multiple words. I’m curently using “\b(\w+)\s?(\s?\1\b” as the regex to capture the pattern,however this only works for single words , like SUGAR(SUGAR