Advanced Regular Expression Q6

Screen Link:
https://app.dataquest.io/c/64/m/369/advanced-regular-expressions/6/backreferences-using-capture-groups-in-a-regex-pattern

My Code:

pattern = r"(\b\w+\b)\s\1"

I can’t wrap my head as to why my solution (above) doesn’t work whilst the correct (see below) does in this excercise

pattern = r"\b(\w+)\s\1\b"

Any kind soul who could provide some examples to help me understand?

Many thanks,
Andrea

Hi @moroa, it’s been a while since I’ve used regex but I’ll try to help you as best I can.

Looking at your pattern vs the one provided by DQ, we can see that the main difference is in our capture group: in other words, should we include \b in the capture group or not?

I did a bit of research and discovered that \1 refers to the matched text, not a regex! (resource)Therefore, your pattern will find “extra” matches like: Google's self driving... because the s after the apostrophe matches the group (\b\w+\b) and then \1 matches on the s in self because the \1 no longer cares about word boundaries…it only cares about the text that matched in the group (namely s).

Here are some other things it will find but shouldn’t:

1. No end in sight as repair work on California's sinking land costs billions
2. Niantic (Pokemon Go) appears to be hosting the entire world on one server
3. Salesforce lost 3.5 hours of customer data in instance NA14 
4. The Theory of Concatenative Combinators
5. Performance Improvements in C Code Using Micro-Optimizations

Reasoning:

  1. this is similar to my example above: the 's followed by a word that starts with s
  2. capture group matches on on and \1 matches on first two characters of one
  3. capture group matches on in and \1 matches on first two characters of instance
  4. capture group matches on The and \1 matches on first three characters of Theory
  5. capture group matches on C and \1 matches on first character of Code

The reason the DQ solution produces the desired results is because that second \b after the reference to \1 ensures we get matches on whole words and not partial ones like we see above.

Regex takes a lot of practice and I’m sorry to say that I have only scratched the surface myself. I find using sites like regexr help a lot for visualizing and breaking down what’s happening and why I get the results that I do.

Hope this helps and that I haven’t led you astray!

3 Likes

hi @mathmike314, thank you very much for responding :pray:. I’m gonna read it a few times now to digest it :grin: and let you know if any more questions!

2 Likes

@moroa Here’s the same take in different words, in case that’s helpful: 369-6 Backreference exercise - #3 by Bruno

The post 354-7 Regex - raw strings and special characters? - #3 by Bruno is also helpful further reading.

1 Like