I do not know how this regex works

Hello,
I would like to split http://www.interactivedynamicvideo.com/ into 3 parts, protocol, domain, and page path by using regex.
I made this regex pattern, r'(\w+):\/\/([\w.-]+)/?(.+)' .

Then I expected to get:

  • http for group 1
  • www.interactivedynamicvideo.com for group 2
  • and nothing for group 3

However, in the result, I got group 3 as /. I do not know why I get / in the group 3. Could someone explain why ?

Hey.

To examine this let’s look at a very similar problem, only with a simpler pattern (([a-z])0?(0)) and a simpler string (b0).

We’ll use re.findall to explore.

>>> import re
>>> re.findall("([a-z])0?(0)", "b0")
[('b', '0')]

In the example above your question is transformed into asking why the result isn’t [('b', '')].

Before we answer this, it’s important to be aware of the fact that capture groups capture text. They might capture nothing if the regular expression pattern is nothing, for example:

>>> re.findall("", "")
['']

This is different that not capturing anything. Here’s an example in which nothing is captured:

>>> re.findall("123", "abc")
[]

Notice the difference. In the first example we got the a list with the empty string (it captured “nothing”). In the second example we got an empty list (nothing was captured).

So your suggestion of not capturing anything doesn’t really work in this regard. Still, let’s dive into what’s happening here.

From regular-expressions.info:

The question mark gives the regex engine two choices: try to match the part the question mark applies to, or do not try to match it. The engine always tries to match that part. Only if this causes the entire regular expression to fail, will the engine try ignoring the part the question mark applies to.

So, in re.findall("([a-z])0?(0)", "b0"), the regex engine will parse b0 and do the following:

  • [a-z] matches b
  • 0? matches 0
  • The second 0 in the pattern makes the whole thing fail because there’s nothing left to match in the string b0, it was already consumed.

At this point the regex engine will change how to interpret ? and ignore the first 0, resulting in the following behavior:

  • [a-z] matches b
  • 0? matches nothing
  • The second 0 in the pattern matches 0.

I hope this helps.

2 Likes

Thank you so much for your answer! it now makes sense to me!

1 Like