import re email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail', 'e-mail', 'eMail', 'E-Mail', 'EMAIL', 'emails', 'Emails', 'E-Mails']) pattern_email = r'\be-?\s?mails?\b' pattern_emailx = r'(e-?\s?mails?)' email_tests.str.contains(pattern_email,flags=re.I) email_tests.str.extract(pattern_emailx,flags=re.I) print(titles.str.extract(pattern_emailx, flags=re.I)) email_mentions = titles.str.contains(pattern_email,flags=re.I).sum()
What I expected to happen:
I had hoped to use str.extract() to get a sample of what I was actually pulling from titles. I would use this feedback to hone the main regex string I was using to eventually solve the question using str.contains()
What actually happened:
What I got instead was a whole column full of NAN’s. This extract regex string was working fine on email_tests . Why does the same regex fail on titles ? I solved the actual test question but the failure of str.extract() to work consistently troubles me and makes me think I don’t really understand how it works, and this detracted from the solution in terms of not being able to see what my regex string was identifying in titles. What’s going on here?
Output 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN ... 20094 NaN 20095 NaN 20096 NaN 20097 NaN 20098 NaN Name: title, Length: 20099, dtype: object