Act fast, special offers end soon! Up to $294 is savings when you get Premium today.
Get offer codes

Question on str.extract() issue while working out exercise 354-11

https://app.dataquest.io/m/354/regular-expression-basics/11/challenge-using-flags-to-modify-regex-patterns

My Code:

import re

email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
              'e-mail', 'eMail', 'E-Mail', 'EMAIL', 'emails', 'Emails',
              'E-Mails'])

pattern_email = r'\be-?\s?mails?\b'
pattern_emailx = r'(e-?\s?mails?)'
email_tests.str.contains(pattern_email,flags=re.I)
email_tests.str.extract(pattern_emailx,flags=re.I)
print(titles.str.extract(pattern_emailx, flags=re.I))
email_mentions = titles.str.contains(pattern_email,flags=re.I).sum()

What I expected to happen:
I had hoped to use str.extract() to get a sample of what I was actually pulling from titles. I would use this feedback to hone the main regex string I was using to eventually solve the question using str.contains()

What actually happened:
What I got instead was a whole column full of NAN’s. This extract regex string was working fine on email_tests . Why does the same regex fail on titles ? I solved the actual test question but the failure of str.extract() to work consistently troubles me and makes me think I don’t really understand how it works, and this detracted from the solution in terms of not being able to see what my regex string was identifying in titles. What’s going on here?

Output
0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
20094    NaN
20095    NaN
20096    NaN
20097    NaN
20098    NaN
Name: title, Length: 20099, dtype: object
1 Like

Hi @ghighcove:

As mentioned in this article:

Square brackets match something that you kind of don’t know about a string you’re looking for. If you are searching for a name in a string but you’re not sure of the exact name you could use instead of that letter a square bracket. Everything you put inside these brackets are alternatives in place of one character.

Thus, the correct regex pattern would be r"\be[\-\s]?mails?\b". We use \- to denote a literal hyphen and \s to denote a whitespace character. Thus [\-\s]? denotes an optional hyphen or whitespace character between e and mail (or all other alternatives–uppercase or lowercase).

In your case, you didn’t escape the hyphen and generally, doing one thing more than once (i.e. putting a ? sign behind both \- and \s is generally not a good practice in computer science as there is often a better way to simplify things, like in this example.

Hope this clarifies!

2 Likes

Thanks for the help here – but why does it work if you instead use str.extract() for the same pattern_emailx = r’(e-?\s?mails?)’ search against the email_tests pd.Series? If it had broken there, I would at least know I was doing something wrong, but the inconsistency of it working totally (and even if it had just been in-part) vs. not working at all, and returning 100% NAN’s, is confusing. I appreciate refining my query and the best practices tip (I sincerely do appreciate that), but mechanically, what is failing with this line in how I apply the str.extract() ? A search online on Stack Overload finds that at least one other person had a similar issue where str.extract() was not working consistently.

email_tests.str.extract(pattern_emailx,flags=re.I) #works consistently to return a list of extracted finds.
print(titles.str.extract(pattern_emailx, flags=re.I)) # 100% NAN’s. Why? Also against a pd.Series as above.

1 Like

Hi @ghighcove

If I’m not wrong, the - caused the pattern not to match any of the strings in the email_tests list. Thus the dataframe is made up of NaNs, meaning that there are no values that matched the specified pattern.

Could you provide the link for this?

1 Like

Clarifying – the str.extract() did work on email_tests, but did not work on titles, thus my confusion. It worked one place, but not another, with presumably the same data type. So from a “using this tool to refine my regex”, I could not use it effectively and thus couldn’t actually see what my regex was grabbing. Given I have run into some strange code window issues here where the exact same code vs. the copied and pasted answer didn’t run the same results, I had wondered if that was also the case here.

Give my code a spin, see what comes out for:
email_tests.str.extract(pattern_emailx,flags=re.I)

I get a real result:
0 email
1 Email
2 e Mail
3 e mail
4 E-mail
…
7 E-Mail
8 EMAIL
9 emails
10 Emails
11 E-Mails
Length: 12, dtype: object

So in this case, I know what my regex query should be grabbing in titles. While it may not get me to the right answer (eventually I got there, but had to peek at the answer given this issue), it would have at least let me use lessons from this mission to solve the challenge.

What isn’t working is the same str.extract() against titles. I get that full list of NAN’s. Why?
So this:
print(titles.str.extract(pattern_emailx, flags=re.I)) # with or without the print, the print was there because I had other lines uncommented before.

Turns out I actually found the question/issue on GitHub (and not StackOverflow):

1 Like

Actually - marking this as resolved, I did a check like this and it looks like it was actually working. Sorry for the confusion, and thank you much for both the help and the pointers! Hopefully someone will read this and learn from my error. I did this to check for results right now:
titles[titles.str.extract(pattern_emailx, flags=re.I).notnull()]

Which got this output:
119 Show HN: Send an email from your shell to your…
161 Computer Specialist Who Deleted Clinton Emails…
174 Email Apps Suck
261 Emails Show Unqualified Clinton Foundation Don…
313 Disposable emails for safe spam free shopping
…
19303 Ask HN: Why big email providers don’t sign the…
19395 I used HTML Email when applying for jobs, here…
19446 Tell HN: Secure email provider Riseup will run…
19838 Petition to Open Source Mailbox
19905 Gmail Will Soon Warn Users When Emails Arrive …
Name: title, Length: 151, dtype: object

2 Likes