email_variations = pd.Series(['email', 'Email', 'e Mail',
'e mail', 'E-mail', 'e-mail',
'eMail', 'E-Mail', 'EMAIL'])
pattern=r"\be[-\s]?mail\b" # Why not using ending word boundary in solution?
What I expected to happen:
What actually happened:
Unsuccessful match with answer for titles_clean
Placing the word boundary at the beginning and the end will ignore
- in the word
- is not a word character.
Remember that the word boundary can occur in one of three positions:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
Means that using word boundary like r"\be[-\s]?mail\b" is wrong?
I have the same problem with understanding this pattern. Why the beginning word boundary is needed but not the ending one?
I think I found the culprit here. Having the end word boundary
\b is good for
email_variations, but not accepted as the right answer for
titles_clean. So let’s compare what’s the different result between our approach and the right answer approach:
# The correct approach
titles_clean_1 = titles.str.replace(r'\be[-\s]?mail', 'email', flags = re.I)
# Our approach
titles_clean_2 = titles.str.replace(r'\be[-\s]?mail\b', 'email', flags = re.I)
# This returns the different results between two approaches
And here’s the output:
161 Computer Specialist Who Deleted Clinton Emails...
261 Emails Show Unqualified Clinton Foundation Don...
1900 Police Emails About Ahmed Mohamed: 'This Is Wh...
2018 Emails from a CEO Who Just Has a Few Changes t...
3967 Russia Is Reportedly Set to Release Clinton's ...
15344 Foundation for Emails 2: Making Email Suck Less
15607 Improvements to Notification Emails
15846 Millions of Gmail, Hotmail and Yahoo Emails an...
16056 Another Hack: 117M LinkedIn Emails and Passwords
19905 Gmail Will Soon Warn Users When Emails Arrive ...
Name: title, Length: 20, dtype: object
Finding any patterns? Looks like the oddball here is the word
Emails. Yeah… there’s an
s there and our approach with the end word boundary excluded anything after ‘l’, in this case the plural ‘s’.
So when the
\b word boundary is applied at both ends, it’s more strict than putting it on only one end. This is definitely something easy to miss. I’m glad this question came up, making a mental note to future self.
So I had my pattern set to: r’\be\s?-?mails?\b’
With the s? at the end before the word boundary, shouldn’t that make it work for all cases?
@sean.d.workman Welcome to the community!
Your pattern does match all the email variations, but that’s also the problem.
I assume this is your code:
titles_clean = titles.str.replace(r'\be\s?-?mails?\b', 'email', flags = re.I)
In the code above, you found all the email variations, even the ones with ‘s’ in the end, and replaced them universally with
email. But the solution code finds the
email part in
emails variations, and only replace the
For example, there’s the title:
Computer Specialist Who Deleted Clinton Emails...
What the solution code does is transform it into
Computer Specialist Who Deleted Clinton emails...
What your code does is transform it into
Computer Specialist Who Deleted Clinton email...
This one is tricky, I will admit it did take some thinking to figure it out…
Hope this helps!
Ahhhh, thank you, thank you. Silly me. Thanks for clearing that up so quickly.
No problem. Glad to be of help!
However the solution provided is not 100% fool proof though.
“EEEE E MAILMAN HAS ARRIVED” would become “EEEE emailMAN HAS ARRIVED” which might be an unintended replace!!
optimal solution would be to execute two str.replace()
titles_clean_0 = titles.str.replace(r'\be[-\s]?mails\b', 'emails', flags = re.I) titles_clean_1 = titles_clean_0.str.replace(r'\be[-\s]?mail\b', 'email', flags = re.I)