Act fast, special offers end soon! Up to $294 is savings when you get Premium today.
Get offer codes

Why using backslash with square brackets?

Screen Link:
https://app.dataquest.io/m/354/regular-expression-basics/7/accessing-the-matching-text-with-capture-groups

My Code:

pattern = r"\[(\w+)\]"
titles_tags=titles.str.extract(pattern)
tag_freq=titles_tags.value_counts()

What is the sense of using backslash before square brackets? Don’t we need the square brackets to match the text like [pdf] or [video] etc. ? or what? Kindly help.

2 Likes

[] have special meaning in regular expression so to escape them we use escape character \. This has been explained here: https://app.dataquest.io/m/354/regular-expression-basics/6/character-classes

  • We can use a backslash to escape characters that have special meaning in regular expressions (e.g. \ will match an open bracket character).
2 Likes

Hello,
I already read that part but it’s confusing me again and again.
This means that we are just searching for strings that are between the square brackets or we are also including the square brackets in out extraction?
Such as,
If using pattern=r"[(\w+)]" : This will search for [pdf], [video] OR pdf, video etc.

You’re close but not quite there yet…

The thing we need to keep in mind is that when working with regular expressions, some characters have special meaning. So, sometimes they mean one thing, and sometimes they mean something else. What we need to decide when writing expressions is whether we want the “special meaning” or the “literal meaning” of the character. If you want the literal meaning (as in, we LITERALLY want to match a “special character”) we need to “escape its special meaning” by using the escape character: \.

In your pattern above, you haven’t used the escape character \ for the square brackets (which is a special character) so when you run this code without an escape character in front of the special character known as [ or ], python will use its “special meaning” and won’t match these characters literally.

Then why this pattern = r"[(\w+)]"* doesn’t work when I change it to pattern = r"(\w+)" ? Why wrong answer when removing the sq. brackets completely? I mean if we don’t want to use the set-ability of brackets and thus escaping it then why it gives wrong output after removing these?

It will search for strings between square brackets, but extract only the string, and not the brackets because the capture group is what is between parentheses.

This means that square brackets are must to be used when doing search using Regex but instead of searching individual character we capture the whole group of characters such as pdf instead p or d or f. Thus, we escape the sq. brackets to disable the ability for individual character’s search.

We want to escape these characters because we want to match them in our search. If we remove them completely, then we are looking to match something else that doesn’t contain the characters [] in the string.

This pattern says: “our capture group is any word.” It says nothing about finding [] as actual characters in our titles…which is an integral part of finding tags (ie things that look like [sometextwithoutspacesorlinebreaks])

The reason we need to escape the square brackets is because we don’t want square brackets to be seen as a character class (eg [a-z] is an example of a character class) we want the square brackets (in this particular situation) to be seen as a literal square bracket character (ie [ and ] as “normal” characters). The only way to tell python that we don’t want to use the square brackets as defining a character class is to put an escape character in front of it: \[ and \]

2 Likes

So, in this case sq. brackets won’t be considered as set brackets?

As long as you place an escape character in front of them, like so: r'\[(\w+)\]', the square brackets will not be considered as defining a set.

Breakdown:

  • first two characters \[ are saying "escape the special meaning of ["…in other words, it is saying to treat [ as a normal character, no special meaning
  • next character ( is signaling the start of a capture group. This is a special character and is being used as a special character because it has not been escaped.
  • next two characters \w are special in that they are a shortcut to the character class defined by [a-zA-Z0-9_]. Interestingly, you can think of this as "escaping the normal meaning of w" so that you don’t have to type such a big ugly character class every time you want to match a word character.
  • the + character means to match the character just before it, one or more times
  • the ) character signals the end of the capture group
  • the last two characters \] are again saying "escape the special meaning of ]" and just treat ] here as a normal character because we do want to match it in our search
1 Like

In the first opening sq. bracket we are using backslash so it won’t be considered as a set start is it so?

Yes, that is correct. The same logic follows for the last square bracket as well since it too has a backslash in front of it.

Then why consider using these square brackets with backslash? Why can’t just remove it?
Also,

[\w+]
and
[(\w+)]
are giving same results when used of below screen link:
https://app.dataquest.io/m/354/regular-expression-basics/10/matching-at-the-start-and-end-of-strings
My code:

pattern1=r"^\[\w+\]"
pattern2=r"\[(\w+)\]$"
beginning_count=titles.str.contains(pattern1).sum()
ending_count=titles.str.contains(pattern2).sum()

We can’t just remove it because then it would not find things that look like: [pdf] or [video] because these strings contain the characters [].

Without a \ in front of special characters like [ and ], these characters would be treated in a special way…this is why we call them special characters: they do special things! But sometimes, like in these exercises, we do not want special characters like [] to behave special in any way…sometimes we just want a [ or a ] to be treated like a normal character that we can search for/match on.

So how do we do that? How do we search for a character that is special? Answer: put a backslash in front of it!

One way that is helping me understand how regular expressions work with backslashes is to think of the backslash and the following character as being one character not two.

For example:
[ --> signals the start of a set
\[ --> this is a literal square bracket, this has no special meaning…it is just one character: [

1 Like

Thanks, got most of it now. But,
https://app.dataquest.io/m/354/regular-expression-basics/10/matching-at-the-start-and-end-of-strings
In the screen link above we get same results using capture groups and without capture groups why so?
Is it because in the Series.str.contains we can use Regex without capturing groups too?

I’m not sure I understand what you are asking/saying exactly without seeing the code that you’re using but just off the top of my head, I would say that capture groups go better with Series.str.extract() rather than Series.str.contains() since the former method returns a capture group whereas the latter returns a boolean series.

Therefore, it would make sense that defining a capture group doesn’t change your results when using Series.str.contains()…because it will return a series full of boolean values (essentially answering the question: “does the series contain the regex epression? T or F?”) and so defining a capture group doesn’t really help answer that question.

1 Like

This confused me a lot the first time I read. Thanks so much for clear explanation :slight_smile:

It is my pleasure.

Happy coding!

1 Like

hi guys,

resuscitating this thread as I have a question on this very point. In principle I understand why we are using \ before the brackets. However I’m just reading up on raw strings, and I understand that the point of adding the r character before a string is exactly to avoid adding extra \ that would make code harder to read. But then why is it that in this exercise we have both the r character and the ?

Many thanks,
Andrea