What's the core difference between (...) and \b

The more I dive into regex the more confusing is to me, the difference between () and \b.

From Python Regex documentation I know that quote:

(...)

"Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed(…)"

\b

"Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string."

Both seem to be similar, the definitions seem to have phenomenological similarities ex.:

(example) will match the same as \bexample\b

\b seems to have just more usage: we can use one \b and we can’t use one ( without ) at the end of an expression. What’s the core guidance for: when we should use this and that expression? What’s the limitation of each? I need more reference to feel that I really got it.

() is used to capture your pattern so you can access the contents of the group in python code, or have later regex patterns reference this group.

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'

Without that capturing pattern you cannot pass those regex search results to the python program for further use.

\b has a very specific meaning, to find word boundaries. Word boundaries are “zero-length” matches, a counter-intuitive concept for people using regex for the 1st time and have the impression that every match is a match of a character of length one. These are used to give a stricter specification of where they want the matches to be found, and have nothing to do with what characters are found.

() does not match word boundaries at all, these two are independent concepts.

1 Like

1. thing:

:flushed: how…how does it works that it prints both: Isaac Newton - instead just Isaac? What’s the logic behind the curtain?

2. thing:

Can you elaborate more about it? Because I don’t understand this “stricter specification”: that’s the hole point. When I test Isaac Newton in regex tester to get closer with understanding (pseudo code):

###Pattern###        ###"Pseudo output###
\b\w+\b \b\w+\b one group: `Isaac Newton`
\b\w+ \w+\b one group: `Isaac Newton`
\b(\w+) (\w+)\b two groups: `Isaac` and `Newton`
(\w+) (\w+) two groups: `Isaac` and `Newton`
\b(\w+) (\w+)two groups: `Isaac` and `Newton`
(\w+) (\w+)\b two groups: `Isaac` and `Newton`

I don’t see the functionality of \b. It seems that I can do exactly the same using only ()

Just how python designs it’s matching mechanism: re — Regular expression operations — Python 3.9.5 documentation
The 0th index group is the whole 1st argument in re.match with the matched values for () capturing groups substituted in

All your test cases coincidentally do not show the difference between with and without \b.
The problem is that usually, \b is used like \b{insert readable word}\b rather than a \w metacharacter with a + metacharacter. The main problem is their combination, with the limited test cases you give.

If you never heard of the word greedy matching, + is a greedy operator. It will match as many as it can. Given a string of text in your test case, \w+ will match all characters up to the 2 ends of the string. Of course, since the \w+ has already matched up to the end, it doesn’t matter whether you add \b on either end or not. Because without \b, the string matches fine, with \b the string also matches fine, because the end of string are considered word boundaries. (or between \w and the beginning/end of the string."), so the \b the did not affect what was captured.

My point is do not use \w+ to test. You are trying to study \b now, not \w+ so keep the behaviour expected and limited as much as possible. Use simpler patterns like Isaac\b (with hardcoded strings for parts of the pattern that are not \b), instead of throwing in so many wild metacharacters which make it mentally more taxing to create test cases and reason about their results. Try to create the simplest test cases first and understand the principles before trying to cover more ground and expand your hypothesis.

In experimental design this is called OFAT (One factor at a time), meaning to study something, you change only that 1 thing and keep all others constant.

Go to https://regex101.com/, type (Isaac)\b as the regex pattern, type Isaac and IsaacI, and observe the outputs in MATCH INFORMATION panel. \b will fail on latter input.
Open the debugger and observe the 21 match steps. How it works is it takes the 1st index in pattern, and tries to match the 1st index in IsaacI.
Focus on where the blue cursor is on IsaacI, it highlights the currently matched range (could be 0 length). Every time you see a red arrow point backwards, that means the engine is trying to restart the pattern matching process, from the 1st index in the pattern, starting from a larger index on IsaacI.

That’s what i mean by “stricter specification”, the fact that things around Isaac can make Isaac not matchable when you add \b. When there is no \b, it doesn’t matter whether there is something around Isaac, if there is, doesn’t matter what is there. It will just match, ignoring what’s around.

After this you can add \b on left side also, prepare a list of \w and \W characters to generate boundaries in the test string, and create more test cases.

1 Like

I heard about it, but I forgot that. Now I see the whole picture. Thank you for the examples and your help.