Why and when to use a word boundary?

Screen Link:

My Code:

email_variations = pd.Series(['email', 'Email', 'e Mail',
                        'e mail', 'E-mail', 'e-mail',
                        'eMail', 'E-Mail', 'EMAIL'])
email_uniform = email_variations.str.replace(r'e[-\s]?mail', 'email', flags=re.I)
titles_clean = titles.str.replace(r'e[-\s]?mail', "email", flags=re.I)

What I expected to happen:
matches all the titles with a variant of e-mail.

Hi,

My first regex pattern in my object email_uniform, matches all the email variants.
In my second object titles_clean, it doesn’t give me the correct answer.

A few questions:
1 Why?

In the answer, I read that I have to use a word boundary. \be[-\s]?mail
2. When do i need a word boundary?
3. Why do I only need to start with a word boundary? (Is it because it starts with a word character followed by a nonword character?)
4. Can you give an example of when to only use a closing word boundary?

Looking forward to your response,
Jeroen

2 Likes

Hi Jeroen,

The word boundary special sequence \b searches the specified characters at the beginning or at the end of a word. In other words, it’s a boundary between a word character \w and a non-word character \W at the beginning or at the end end of a string.

Your pattern \be[-\s]?mail is almost perfect for that given task, only that in some very rare cases it causes issues there. For example, in words like voicemail it will detect voicemail. Or if a previous word ends with “e” and the next one starts with “mail”. Look:

print(titles[450])
print(titles[4504])
print(titles[9006])
print(titles[11096])
print(titles[11659])
print(titles[12619])
print(titles[13432])
print(titles[17440])

Output:

Mailtrain (the open source Mailchimp clone) is getting automation support
The fine art of literary hate mail endures
N1 The extensible, open source mail client
Ask HN: Why do dev communities still use mailing lists?
Donald Trump’s voicemails hacked by Anonymous
Show HN: Undo send mail for Apple Mail
The Mailbox Lights
More Encryption, More Notifications, More Email Security

Hence, without using \b here at the beginning of a string, we’ll detect some wrong titles.

As for a closing word boundary, in general, we can use it, of course, but not in this case. There are some cases when we have “Emails” instead of “Email”, but it seems that in this task we don’t care about such cases:

print(titles[19905])

Output:
Gmail Will Soon Warn Users When Emails Arrive Over Unencrypted Connections

4 Likes