Why we didn't use the ending word boundary in this case?

My Code:

email_variations = pd.Series(['email', 'Email', 'e Mail',
'e mail', 'E-mail', 'e-mail',
'eMail', 'E-Mail', 'EMAIL'])
pattern=r"\be[-\s]?mail\b" # Why not using ending word boundary in solution?
email_uniform=email_variations.str.replace(pattern,"email",flags=re.I)
titles_clean=titles.str.replace(pattern,"email",flags=re.I)


What I expected to happen:
Successful Submission.

What actually happened:

Unsuccessful match with answer for titles_clean

2 Likes

Placing the word boundary at the beginning and the end will ignore - in the word e-mail since - is not a word character.
Remember that the word boundary can occur in one of three positions:

1. Before the first character in the string, if the first character is a word character.
2. After the last character in the string, if the last character is a word character.
3. Between two characters in the string, where one is a word character and the other is not a word character.
3 Likes

Means that using word boundary like r"\be[-\s]?mail\b" is wrong?

I have the same problem with understanding this pattern. Why the beginning word boundary is needed but not the ending one?

1 Like

Hi guys,

I think I found the culprit here. Having the end word boundary \b is good for email_variations, but not accepted as the right answer for titles_clean. So let’s compare what’s the different result between our approach and the right answer approach:

# The correct approach
titles_clean_1 = titles.str.replace(r'\be[-\s]?mail', 'email', flags = re.I)

# Our approach
titles_clean_2 = titles.str.replace(r'\be[-\s]?mail\b', 'email', flags = re.I)

# This returns the different results between two approaches
titles[titles_clean_1!=titles_clean_2]


And here’s the output:

161      Computer Specialist Who Deleted Clinton Emails...
261      Emails Show Unqualified Clinton Foundation Don...
1900     Police Emails About Ahmed Mohamed: 'This Is Wh...
2018     Emails from a CEO Who Just Has a Few Changes t...
3967     Russia Is Reportedly Set to Release Clinton's ...
...
15344      Foundation for Emails 2: Making Email Suck Less
15846    Millions of Gmail, Hotmail and Yahoo Emails an...
19905    Gmail Will Soon Warn Users When Emails Arrive ...
Name: title, Length: 20, dtype: object


Finding any patterns? Looks like the oddball here is the word Emails. Yeah… there’s an s there and our approach with the end word boundary excluded anything after ‘l’, in this case the plural ‘s’.

So when the \b word boundary is applied at both ends, it’s more strict than putting it on only one end. This is definitely something easy to miss. I’m glad this question came up, making a mental note to future self.

12 Likes

So I had my pattern set to: r’\be\s?-?mails?\b’

With the s? at the end before the word boundary, shouldn’t that make it work for all cases?

1 Like

@sean.d.workman Welcome to the community!

Your pattern does match all the email variations, but that’s also the problem.
I assume this is your code:

titles_clean = titles.str.replace(r'\be\s?-?mails?\b', 'email', flags = re.I)


In the code above, you found all the email variations, even the ones with ‘s’ in the end, and replaced them universally with email. But the solution code finds the email part in emails variations, and only replace the email part.

For example, there’s the title: Computer Specialist Who Deleted Clinton Emails...

What the solution code does is transform it into Computer Specialist Who Deleted Clinton emails...

What your code does is transform it into Computer Specialist Who Deleted Clinton email...

This one is tricky, I will admit it did take some thinking to figure it out…

Hope this helps!

7 Likes

Ahhhh, thank you, thank you. Silly me. Thanks for clearing that up so quickly.

1 Like

No problem. Glad to be of help!

However the solution provided is not 100% fool proof though.

“EEEE E MAILMAN HAS ARRIVED” would become “EEEE emailMAN HAS ARRIVED” which might be an unintended replace!!

optimal solution would be to execute two str.replace()

titles_clean_0 = titles.str.replace(r'\be[-\s]?mails\b', 'emails', flags = re.I) titles_clean_1 = titles_clean_0.str.replace(r'\be[-\s]?mail\b', 'email', flags = re.I)