369-9 Advanced Regular Expressions

Clicking on the image will take you to the relevant screen in the app.

Screen Shot 2020-01-15 at 10.26.00 AM

Can anyone please explain why it does not work like that?

1 Like

Hey, Sherif.

Please read the following guidelines on how to ask a good question:

Asking a good question will help people get you an answer faster, and it will help the rest of the community by making it more accessible to them in many ways. I’ve done some minimal fixes to the topic, but much could still be done to improve it; it would be great if you could do it.

2 Likes

The short answer is that your code is correct while Dataquest’s isn’t.

To understand what is going on, we’ll focus on one of the cases where the patterns differ:

>>> import re
>>> s=("https://translate.google.com/translate"
... "?hl=en&ie=UTF8&prev=_t&sl=de&tl=en"
... "&u=http://www.golem.de/news/"
... "em-drive-der-warp-antrieb-muss-noch-warten-1606-121641.html")
>>> print(s)
https://translate.google.com/translate?hl=en&ie=UTF8&prev=_t&sl=de&tl=en&u=http://www.golem.de/news/em-drive-der-warp-antrieb-muss-noch-warten-1606-121641.html

Above I split the string into multiple lines using string literal concatenation so that the code lines wouldn’t be too long.

Now let’s create both patterns.

>>> sherifPattern=r"(https?)://([\w\.\-]+)/?(.*)"
>>> dqPattern=r"(.+)://([\w\.\-]+)/?(.*)"

We’ll use re.findall to test how each of these patterns behaves with s. We start with your pattern.

>>> sherif = re.findall(sherifPattern, s)[0]
>>> sherif
('https', 'translate.google.com', 'translate?hl=en&ie=UTF8&prev=_t&sl=de&tl=en&u=http://www.golem.de/news/em-drive-der-warp-antrieb-muss-noch-warten-1606-121641.html')

And now Dataquest’s:

>>> dq = re.findall(dqPattern, s)[0]
>>> dq
('https://translate.google.com/translate?hl=en&ie=UTF8&prev=_t&sl=de&tl=en&u=http', 'www.golem.de', 'news/em-drive-der-warp-antrieb-muss-noch-warten-1606-121641.html')

Let’s take a look at each of the entries in the result one at a time:

>>> for pair in zip(sherif, dq):
...     print(*pair, sep="\n")
...     print("-"*79)
... 
https
https://translate.google.com/translate?hl=en&ie=UTF8&prev=_t&sl=de&tl=en&u=http
-------------------------------------------------------------------------------
translate.google.com
www.golem.de
-------------------------------------------------------------------------------
translate?hl=en&ie=UTF8&prev=_t&sl=de&tl=en&u=http://www.golem.de/news/em-drive-der-warp-antrieb-muss-noch-warten-1606-121641.html
news/em-drive-der-warp-antrieb-muss-noch-warten-1606-121641.html
-------------------------------------------------------------------------------

We see that with respect to the string s, your pattern captures the domain as https, which is correct. Whereas Dataquest’s pattern is something much, much longer. So what happened here?

Basically, in Dataquest’s pattern ((.+)://([\\w\\.\\-]+)/?(.*)), the (.+) part matched as much as it could until it found something that matched what follows whatever was captured with (.+):

image

You can see more details by opening the link in the image.

Since s pretty much is two consecutive links, Dataquest’s pattern captured everything up to and including the domain of the second link.

This happened specifically because of (.+). The Python documentation does a good job of explaining what is going on, so I’ll just paste the relevant part here:

image

Another way of fixing Dataquest’s pattern is to include a ? to make + non-greedy, like so: (.+?)://([\w\.\-]+)/?(.*).

I hope this helps.

3 Likes

That’s a very in depth answer. Thanks you Bruno.

I have another question, I came to a similar answer as the OP:
r"(https?)://([\w\-\.]+)\/?(.+)?"

I understand that the + looks for “One or more of” while the * looks for “Zero or more of” and is thus more specific. However the inclusion of the + fills the empty rows in the page path column with a NaN. That seems like it would be potentially more useful than an empty cell.

Is there a reason it’s doing this and any instance where it might preferable? Thanks.

I can’t think of any serious advantages of one over the other. One is NaN, the other is the empty string. Each of these has its own advantages and disadvantages.

1 Like

That makes good sense. I suppose I could always fill empty strings with NaN in needed. Thanks.