354-11 Regular Expressions Basic

For the challenge on page 11, I used the pattern,

my_pattern = r’\be-?\s?mail\b’

But your pattern is the following:-

pattern = r"e[-\s]?mail"

With my code, I got all TRUE on the test list, but yields 108 “email” results instead of your 151 on the target list. Can you explain your pattern, especially the first backslash after “e”?

Thank you.

(PS. Although I am including the backslash in your pattern code above, it is not showing up)

\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
e matches the character e literally (case sensitive)
-? matches the character - literally (case sensitive)
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
\s? matches any whitespace character (equal to [\r\n\t\f\v ])
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
mail matches the characters mail literally (case sensitive)
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
’ matches the character ’ literally (case sensitive)

e matches the character e literally (case sensitive)
Match a single character present in the list below [-\s]?

  • - matches the character - literally (case sensitive)
  • \s matches any whitespace character (equal to [\r\n\t\f\v ])
  • ? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)

mail matches the characters mail literally (case sensitive)

You have to escape the character since \ uses to present commands in the regex expression.

\\ matches the character \ literally (case sensitive)

You can use the code below test for string that works for specific pattern.
Simply change the regex string to the pattern string listed above.

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"e[-\s]?mail"

test_str = ""

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Sorry that last bit of code is more complex than my level but I will try to run it on my console.

As for my original question, the answer is still unclear. I am referring to the first backslash after “e”, the one before “-” in brackets.

If my question still isn’t clear then never mind.

Thank you though

\ is a backslash that escape characters. That is, \ is some command to present an operation.

\s means any characters that represent a space.
\ helps to differentiate a normal “s” character and “\s” space.

[-\s]

[ and ] means any character is allowed are presented in the brackets.
These allowed characters are:

  • - a literal dash
  • \s any characters that represent a space

? means 0 or more characters allowed.

[-\s]?

(the character can be either - or a space)?
= 0 or more characters with either a space or a dash.

The 1st pattern is both more restrictive and permissive than the 2nd pattern.
Restrictive because of the \b. Permissive because you broke up [-\s]? into -?\s? which makes it possible for both - and space to appear.

Thank you. This was helpful!

I think the answer to the challenge problem should be 143 instead of 151. Even if you use the exact same code that is mentioned in the answer, you will not get 151.

Hi all,
I’ll use this thread instead of making a new post.
There is a difference between my first try and the expected answer (=143) and I think many of us experienced this difference at some point.

I think there is a misunderstanding on the expected answer, since with the correct answer there are some words included such us:

  • Emailing
  • Emails
  • email-leak

For me was not clear if “Emailing” and “Emails” should be taken as valid or not. Maybe the exercise instructions should clarify this.

Hi
I agree with what fedepereira said.

My initial approach was pattern=r’\b(e[-\s]?mail[s]?)\b’
and it passed the list test matching all listed items but in the exact dataset it found 141 matches when I removed the last \b the pattern included items listed by fedepereira.

BTW great course. I have been learning regular expressions from the book: Automate boring stuff with python, but your course is really comprehensive

1 Like