Mismatch in result

Screen Link:
email replace

My Code:

email_variations = pd.Series(['email', 'Email', 'e Mail',
                        'e mail', 'E-mail', 'e-mail',
                        'eMail', 'E-Mail', 'EMAIL'])

pattern = r"\be[\s-]*mail\b"

pattern1 = r"e[\s\-]?mail"

pattern2 = r"\be[\s-]*mail[Ss]*\b"

email_uniform = email_variations.str.replace(pattern,"email",flags=re.I)

titles_clean = titles.str.replace(pattern,"email",flags=re.IGNORECASE)

titles_clean_1 = titles.str.replace(pattern1,"email",flags=re.IGNORECASE)

titles_clean_2 = titles.str.replace(pattern2,"email",flags=re.IGNORECASE)



mismatch1=titles[~titles_clean.eq(titles_clean_1)]
mismatch2=titles[~titles_clean_2.eq(titles_clean_1)]
print("pattern:\n",titles_clean[[161,450,9006]])
print("pattern1:\n",titles_clean_1[[161,450,9006]])
print("pattern2:\n",titles_clean_2[[161,450,9006]])

What I expected to happen:
I expected that “source Mailchimp” should not get matched so I written pattern2 = r"\be[\s-]mail[Ss]\b" , when I submitted test case failed .

In answer section r"e[\s-]?mail" pattern is given but this has some side effects:
it matched below:
source Mailchimp
open source mail client

This is wrong, it should not match that.

What actually happened:
Answer given matched extra lines

pattern:
 161     Computer Specialist Who Deleted Clinton Emails...
450     Mailtrain (the open source Mailchimp clone) is...
9006          N1  The extensible, open source mail client
Name: title, dtype: object
pattern1:
 161     Computer Specialist Who Deleted Clinton emails...
450     Mailtrain (the open sourcemailchimp clone) is ...
9006           N1  The extensible, open sourcemail client
Name: title, dtype: object
pattern2:
 161     Computer Specialist Who Deleted Clinton email ...
450     Mailtrain (the open source Mailchimp clone) is...
9006          N1  The extensible, open source mail client


Hi @eashwary:

I think you need to do it individually. You have to use both \- for the hyphen and \s for whitespace characters instead of \s-.

Hope this helps!

I think [ ] is class , character losses special meaning in that.

the problem is with answer provided in answer section of dataquest , they have not used \b because of which it matches “source mail”, though it should not match as it is not email. It also matches source Mailchip ,which is wrong .

[ Advanced Regular Expressions - part 4](Advanced Regular Expressions - part 4)

369-5 Reg Ex Making my Brain Hurt!

I would say it is ambiguous. The instructions say that we should replace the “matches” in the list below with email.

['email', 'Email', 'e Mail',
 'e mail', 'E-mail', 'e-mail',
'eMail', 'E-Mail', 'EMAIL']

The issue with this is that the list above doesn’t contain “matches” (whatever that means), it contains strings.

It doesn’t specify whether they should match the beginning of the word, making it ambiguous what is it that is supposed to happen with occurrences like source Mailchimp, SecureMyEmail, hate mail, Gaggle Mail, style mailboxes, source mail, use mailing, SlideMail, Apple Mail, The Mailbox, TRACEMAIL and Source Mailbox.

I think the author’s intent was to have these occurrences remain unchanged (and also think that’s the correct way to go about it), alas the solution modifies these titles. I’ll fix this soon.

However, your solution also doesn’t quite work as it will turn the plural versions into singular, as you can see your own output.

Instead, I’ll go with either r"\be[\s-]*mail(?=s\b|\b)" or r"\be[-\s]?mail". The first one is more robust, as it allows us to not modify titles like The uber tool EmailSnoop (I just made this up), however, this is unnecessary for this dataset (there aren’t cases like this, I don’t think) and I don’t see that it matters for the main goal, so I’ll probably choose the second option.