Going fast! #DataquestChallenge Premium Annual Offer:
500 get 50% & the next 1000 get 40% off.

Hi
I just did exercise 11. Challenge: Using Flags to Modify Regex Patterns. My answer was accepted by DQ platform, but it’s different from DQ answer. Can someone more advanced in regex check my code? I don’t know what to think about it

email_mentions = titles.str.contains(r'\be(-?\s?)mails?\b', flags=re.I).sum()


pattern = r"\be[\-\s]?mails?\b"
email_mentions = titles.str.contains(pattern, flags=re.I).sum()


It’s not about whether or not your answer is correct. It clearly is since it was accepted by the platform.

It’s about what edge cases your regex is accounted for vs theirs.

As you know, [] match any one character inside of them.

You are not using [], you instead use (), but, then you use ? inside of those. And as you know, ? means you match 0 or one instances of the character before that ?.

Both work, given the kind of inputs we have. Both of them will detect email or e-mail or e mail

However, if the input is e- mail, then yours will work but DQ’s won’t. Because DQ’s will look at - being present and then check for mails. The space will make it assume that it’s not a match because the [] only look for either - or the space.

But since you are checking for 0 or 1 instance of both - and space, yours will detect it.

So, from that input’s perspective, yours is likely better a solution.

Similarly, if the input was e -mail then neither will work. Because in your case, the order becomes a factor. Your pattern looks for - first, and then space. Since space comes first in the input, it ignores the - and there’s no match.

1 Like

I thought that [\-\s]? check every possibility like:
0 0
0 1
1 0
1 1
But, as you said:

[quote=“the_doctor, post:2, topic:553275”]
The space will make it assume that it’s not a match because the [] only look for either - or the space.[/quote]
Just checked that it isn’t because ? is outside [] - the same thing is with [\-?\s?], so it’s the “nature” of []. It’s mad what happens behind the curtain, or being more accurate: there are many hidden factors in so many places and I have no idea what’s going on. Is it based on any simple logic?

[\-\s] will match any one of the two, - or space. So, e-mail or e mail will match. That’s a 0 1 or 1 0 situation. It won’t match a 0 0 and 1 1 situation

Adding the ?, [\-\s]?, will match [\-\s] between 0 and 1 times.

So, e-mail or e mail or email will match.

That’s a 0 1, 1 0, 0 0 situation, and not a 1 1 situation

If you wanted 1 1 as well, then something like the following might work -

[\-\s]{0,2}

That will match [\-\s] between 0 to 2 times. So, email, e-mail, e mail, e--mail, e -mail, e- mail, e mail (this should have double spaces but the editor doesn’t display it as such).

I would say it’s just the logic of the syntax. It can take time to get used to. Beyond a certain point, it’s best to rely on tools and experimentation. As far as I know, it’s common for professionals to struggle with regex just as much.

1 Like

Brief question: why do you put the \ in front of the - ?
[-\s] worked just as fined as [-\s]. Could you explain on that?

Thank you very much in advance!