Confused on solution

Screen Link:
Learn data science with Python and R projects)

My Code:

Hello,
I wonder why it doesn’t work like this? email_mentions has 360 values before I add the s?$, but after that 0. And why is necessary to use word boundary \b? I really didn’t get what it does. It searches for the specified string at the start/end of the word?

Thanks a lot!

1 Like

Hello @zavatevlad26,

The titles requires a slightly different pattern than the one used for email_tests. Using ^ and $ works for the tests because the whole string/line starts with the first letter for email which is ‘e’ and ends with the last letter for email which is either ‘l’ or ‘s’. However, the strings in titles is something like this – “Show HN: Send an email from your shell to yourself without pain” – which doesn’t start with ‘e’ and also doesn’t end with either ‘l’ or ‘s’. The word ‘email’ is a now part of a longer string and not the only one in the string like in email_tests.

The word boundary \b is necessary because without them, any word that contains ‘email’ will be matched. The challenge is that ‘email’ needs to be its own word and not part of another word. For example, \bemail\b will match “My email is this” but it won’t match “My voicemail is so long” or “My emailer is awful”.

The boundary is needed at the start of the pattern or else something like [Ee]mails?\b will capture this string in index 11659: “ Donald Trump’s voicemails hacked by Anonymous”.

On the other hand, the boundary is needed at the end of the pattern or else something like \b[Ee]mail will capture this string in index 14261: “Emailing SaaS companies to test support time”.

So both boundaries are required and missing one or the other can give incomplete matches.

3 Likes

Thanks a lot for the clarification. It makes sense now, but I thought it searches for a string with [Ee], not that is supposed to be the first word in the sentence.

2 Likes

Yup, that’s how it works without the ^ and $. The [Ee] can happen anywhere in the string.

If you have ^ but no $ e.g. ‘^[Ee]-mail’, [Ee] will need to be at the beginning of the string. Examples: ‘E-mailing my secret lover’ or ‘e-mail is a no-no’.

In contrast, with $ but no ^, as in your answer, [Ee] can be anywhere as long as it is a part of ‘email’, ‘emails’, etc that is at the end of the string. Examples: ‘You have 100 e-mails’ or ‘My head exploded from reading that one e-mail’.

And with the word boundary, [Ee] can happen anywhere and not restricted to the start and end of the string. The only restriction that happens with the word boundary is that for something like this - \b[Ee]mail - will require [Ee] to be preceded by a non-word character like white space.

I haven’t tested it yet but maybe it’s possible to use the ^ and $without the word boundaries for the exercise. Something like ‘^.* [Ee]-mails? .*$’. Perhaps it would work but some slight modifications are needed.

Another thing to consider is your answer is broad with the use of [a-zA-Z] so something like e-book or e-shop will be matched as well. I’m not sure if it will still work with titles because I haven’t tested it yet.

Regex tends to be confusing to learn especially considering its somewhat unnatural syntax. So, please ask for more clarification if needed.

1 Like