354-7 Regex - raw strings and special characters?

On regular expressions basics, screen 7. Accessing the Matching Text with Capture Groups, we learn that we should use raw strings r"string" in order to avoid any special characters such as \s if we want to search that exact characters in the string for instance: "hello\s".

But then in the same lesson, its written the following regex:
r"(\[\w+\])"
in this case \w acts as finding “Any digit, uppercase, lowercase or underscore character”. So why didn’t r"string" avoid this happening?

I dont understand then the use of r"string" if inside special characters would still be utilized?

Thanks!

2 Likes

Python escaping and regex special characters using the backslash are 2 separate concepts.

import re
re.match('hi\\\\m','hi\m')  # not raw string
re.match(r'hi\\m','hi\m')   # raw string

Both give same match object.
I’m also not entirely familiar with all these escaping stuff, just bear in mind there are 2 layers of parsing backslashes, python interpreter transforms the string literal into something else first before regex engine sees them. If you use r'rawstring' the python interpreter transformation step is skipped.

1 Like

Han pointed in the right direction. Let’s explore it a little more.

The first layer

Let’s begin by printing some stuff and seeing what happens:

>>> print(r"\n", "\n")
\n 

>>> print(r"\w", "\w")
\w \w

Notice how in the first print call, it first printed \n and then it printed an actual new line. In the second print call, we got the same result both times.

What we’re seeing here is Python’s interpreter parsing the inputs and modifying them according to certain rules.

For Python’s interpreter, \n is a special character. When we “raw it” we’re telling Python to interpret it literally instead of with its special meaning. That’s why we get the behavior we saw above and I repeat below.

>>> print(r"\n")
\n
>>> print("\n")

This is the first layer that Han mentioned.

The second layer

When we enter the regex world, there is an additional layer, that of regular expressions. You can think of regular expressions as a language within a language — more specifically, as a language within Python.

This language, just like the Python interpreter, reads strings and interprets them. And just like Python, some symbols in regex have special meanings: \w, \b and \d to name a few.

Let’s explore the example you asked about. We’ll limit our experimentations to re.findall.

>>> from re import findall
>>> s = "Waddup?\nNothing much. What about you?"

We’ll use s to experiment with the pattern [\w+] and variations of it.

>>> findall(r"[\w+]", s)
['W', 'a', 'd', 'd', 'u', 'p', 'N', 'o', 't', 'h', 'i', 'n', 'g', 'm', 'u', 'c', 'h', 'W', 'h', 'a', 't', 'a', 'b', 'o', 'u', 't', 'y', 'o', 'u']

It matched every letter, and it did so because of \w. As you observed \w didn’t lose its special meaning despite the use of a raw string. If you were to run findall("[\w+]", s) you’d find the same result.

This happens because the regex engine — the second layer mentioned above, receives as input the same thing, whether you use r"[\w+]" or "[\w+]". That’s because after the Python interpret is done with these strings, they are indeed the same:

>>> print(r"[\w+]", "[\w+]")
[\w+] [\w+]
>>> print(r"[\w+]" == "[\w+]")
True

To repeat the point: the regex engine sees the same thing.

What you’re doing when you’re using raw strings is telling the first layer — the Python interpret — to not use the special meaning of the characters.

As it happens, \w doesn’t have a special meaning to the Python interpreter (it does to the regex engine), so when you use \w and tell Python to ignore its special meaning, Python doesn’t even know what you’re talking about.

The same isn’t true for \n which has a special meaning for the Python interpret, but not for the regex engine.

The duality explored above is the reason why \b is so instructive: it has a special meaning in both the interpreters:

  • For Python, \b is the backspace character. Let’s see it in action:

    >>> print(r"13\b 5") #I'll refer to this line in the next bullet
    13\b 5
    >>> print("13\b 5")
    1 5
    

    We see that when using a raw string, we tell Python to not use \b's special meaning and it prints it literally. When we do not use a raw string, it prints the character — this is seen by the fact that it deleted the character (3) right next to it.

  • Let’s now explore \b in conjunction with the regex engine.

    >>> s = "13 57"
    >>> findall(r"13\b 5", s)
    ['13 5']
    
    

    We found a match despite the fact that we used raw strings and so \b's special meaning should have been abandoned.

    If you understood what I’ve explained above, this should come as no surprise. This happened because raw strings just told Python to suppress the special meaning (we saw this in the bullet point above, in the commented line), it didn’t tell the regex engine to suppress the special meaning.

    What happened was Python got a raw string, it suppressed \b's special meaning, so the value remained equal 13\b 5 and this is what was passed to the regex engine. In turn, the regex engine got the pattern “number1, following by the number 3, right next to a word boundary, followed by a space, followed by the number 5” and it found a match in 13 57, as it should.

I hope this helps.

4 Likes

Now i permuted with + and without, and with [] and without, and got confused with []. Why are these 3 equal? (s = "Waddup?\nNothing much. What about you?")
findall(r"[\w+]", s) == findall(r"[\w]",s) == findall(r"\w",s)

Can you clarify what it is that you find confusing? You’re making two comparisons here (three by transitivity).

If you find something odd, there is at least one comparison that looks wrong to you. In this comparison, does any of the two results stand out as weird?

I don’t understand how a + goes with []. (google not showing anything) My understanding is that [] is usually populated with individual characters like [abc] and can match individual characters only. What happens when they become a character class like [\w]?

Second point of confusion (the 1st equality previously) is shouldn’t the + be doing something? Why are the outputs all single letters. I was expecting findall(r"[\w+]", s) to give the same as findall(r"\w+",s), that is ['Waddup', 'Nothing', 'much', 'What', 'about', 'you'].

Got it. Basically, inside square brackets special symbols (like + and .) lose their special meaning; and characters classes still work.

More details and the relevant documentation follow from this post.

So the pattern [\w+] is seen by the regex engine as it being the character class \w and the literal symbol +.

3 Likes

import re
s=""“abc
“””

print(re.match(“abc\n”, s))
<_sre.SRE_Match object; span=(0, 4), match=‘abc\n’>

print(re.match(r"abc\n", s))
<_sre.SRE_Match object; span=(0, 4), match=‘abc\n’>

Why both are returning same answer, whereas, to use \n for regex, we need to use “\n” to escape newline character?