Word boundary: Advanced Regular Expressions

Screen Link:
https://app.dataquest.io/m/369/advanced-regular-expressions/4/counting-mentions-of-the-c-language

My Code:

def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

first_10_matches(r"\b[Cc]\b")

What I expected to happen:
Only select strings with “C” and not “C++” or “C.”

What actually happened:

it selected strings like:
VW **C.**E.O. Personally Apologized to President Obama in Plea for Mercy
Lisp, **C++**: Sadness in my heart

I didnt understand as r"\b[Cc]\b" would mean select only one of C or c with word boundary on each side but neither ++ nor . is a word boundary. I think I am wrong with the concept of word boundary somewhere.

1 Like

@bhumikagupta100366:

\b matches the starting and ending of the defined character set you wish to filter. It might be abit confusing the interpret [^ ... ] as compared with [ ... ] and ^ separately as they have different outcomes. [^ ... ] negates the character set specified within (i.e. replacing the ... placeholder) as described here.

There is also a need to specify which to negate since having \b to enclose the character set itself will not be effective in filtering out such exceptions (since it still will match anything containing lower or uppercase c, and anything after or before it).

Hope this helps!

Why do you say so?
Word boundaries are 0-length matches looking for adjacent word (\w) and non-word (\W) characters
If you go to https://regex101.com/ and test \W pattern on . or + both of them are found as non-word characters.
This means you r"\b[Cc]\b" has correctly found C+ and C.
No need to discuss C++ because the 2nd + is irrelevant here. \b only looks at 2 characters (1 word, 1 non-word, and if we consider C to be the word, only the next + is relevant, not the next two.

I had the same question but:
upon rereading the definition of word boundary, I see that “it matches the position between a word and a non-word character”. + is a non-word character so C+ is matched.
I played a bit with it at Pythex (a Python regex machine) and it seems that \b does match the +.
In any case your query signifies a nice grasp of word boundaries.