Han pointed in the right direction. Let’s explore it a little more.
The first layer
Let’s begin by printing some stuff and seeing what happens:
>>> print(r"\n", "\n")
>>> print(r"\w", "\w")
Notice how in the first print call, it first printed
\n and then it printed an actual new line. In the second print call, we got the same result both times.
What we’re seeing here is Python’s interpreter parsing the inputs and modifying them according to certain rules.
For Python’s interpreter,
\n is a special character. When we “raw it” we’re telling Python to interpret it literally instead of with its special meaning. That’s why we get the behavior we saw above and I repeat below.
This is the first layer that Han mentioned.
The second layer
When we enter the regex world, there is an additional layer, that of regular expressions. You can think of regular expressions as a language within a language — more specifically, as a language within Python.
This language, just like the Python interpreter, reads strings and interprets them. And just like Python, some symbols in regex have special meanings:
\d to name a few.
Let’s explore the example you asked about. We’ll limit our experimentations to
>>> from re import findall
>>> s = "Waddup?\nNothing much. What about you?"
s to experiment with the pattern
[\w+] and variations of it.
>>> findall(r"[\w+]", s)
['W', 'a', 'd', 'd', 'u', 'p', 'N', 'o', 't', 'h', 'i', 'n', 'g', 'm', 'u', 'c', 'h', 'W', 'h', 'a', 't', 'a', 'b', 'o', 'u', 't', 'y', 'o', 'u']
It matched every letter, and it did so because of
\w. As you observed
\w didn’t lose its special meaning despite the use of a raw string. If you were to run
findall("[\w+]", s) you’d find the same result.
This happens because the regex engine — the second layer mentioned above, receives as input the same thing, whether you use
"[\w+]". That’s because after the Python interpret is done with these strings, they are indeed the same:
>>> print(r"[\w+]", "[\w+]")
>>> print(r"[\w+]" == "[\w+]")
To repeat the point: the regex engine sees the same thing.
What you’re doing when you’re using raw strings is telling the first layer — the Python interpret — to not use the special meaning of the characters.
As it happens,
\w doesn’t have a special meaning to the Python interpreter (it does to the regex engine), so when you use
\w and tell Python to ignore its special meaning, Python doesn’t even know what you’re talking about.
The same isn’t true for
\n which has a special meaning for the Python interpret, but not for the regex engine.
The duality explored above is the reason why
\b is so instructive: it has a special meaning in both the interpreters:
\b is the backspace character. Let’s see it in action:
>>> print(r"13\b 5") #I'll refer to this line in the next bullet
>>> print("13\b 5")
We see that when using a raw string, we tell Python to not use
\b's special meaning and it prints it literally. When we do not use a raw string, it prints the character — this is seen by the fact that it deleted the character (
3) right next to it.
Let’s now explore
\b in conjunction with the regex engine.
>>> s = "13 57"
>>> findall(r"13\b 5", s)
We found a match despite the fact that we used raw strings and so
\b's special meaning should have been abandoned.
If you understood what I’ve explained above, this should come as no surprise. This happened because raw strings just told Python to suppress the special meaning (we saw this in the bullet point above, in the commented line), it didn’t tell the regex engine to suppress the special meaning.
What happened was Python got a raw string, it suppressed
\b's special meaning, so the value remained equal
13\b 5 and this is what was passed to the regex engine. In turn, the regex engine got the pattern “number1, following by the number 3, right next to a word boundary, followed by a space, followed by the number 5” and it found a match in
13 57, as it should.
I hope this helps.