369-8 Advanced regular expressions - Mission 8 Extracting Domains - Regex Question

Screen page: https://app.dataquest.io/m/369/advanced-regular-expressions/8/extracting-domains-from-urls

Here is my code:

test_urls = pd.Series([
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param',
 'http://css-cursor.techstream.org'
])

pattern = r'(?<=//)(.*\.\w{2,4})'

test_urls_clean = test_urls.str.extract(pattern)

print(test_urls_clean)

My pattern picks up all of the URLs correctly except for the 3rd and 7th URLs. What I would like to do is add something like (?!/) or [^/]. Basically, is there a way to write this using my syntax above such that the last part \w{2,4} DOES NOT contain “/”?

Thanks for the help!
David

Hey, David.

Did you mean something else when you referenced \w{2,4} in the last paragraph? Because that pattern certainly doesn’t match /.

I think what you’re looking for is (?<=//)([^/]*\.\w{2,4}) but I’m not sure because I don’t fully understand your question. This won’t pass this screen, I’ll let you work on this.

Let me know if you need more help.

Thank you so much for getting back to me. Yes, what you put in there is correct. However, when I run it I get:

0                                      www.amazon.com
1                     www.interactivedynamicvideo.com
2      www.nytimes.com/2007/11/07/movies/07stein.html
3                                       evonomics.com
4                                          github.com
5        phys.org/news/2015-09-scale-solar-youve.html
6                                        iot.seeed.cc
7   www.bfilipek.com/2016/04/custom-deleters-for-c...
8                               beta.crowdfireapp.com
9                                        www.valid.ly
10                          css-cursor.techstream.org

Indices 2, 5 and 7 have the entire URL. I can’t figure out why they do when the end of my regex is \w{2,4}. That is why I was trying to figure out some way to say that the character could not be a /.

Could you point me in the right direction? Am I headed down a road with this regex that I just can’t solve?

Thanks!
David

Hey, David.

Continuing from my suggestion, the regex is very likely salvageable. Did you try solving this screen again? What issues did run into?

The end of your regex pattern says that it should end with two, three or four alpha numeric characters. And they do! They all end with html. You may want to ask something else, but I’m not sure what it is.

In any case, regular expressions 101 can be a useful resource. Do let me know if you have further questions.

Thank you so much again for getting back to me. Maybe I can ask my question a different way:

My assumption is that my regex says this (written out in english):

The capture group will be preceded with two forward slashes. Start capture group. Select any set of characters that are NOT a forward slash that are followed immediately by a period. Directly after the period there should be between 2 - 4 word characters and then the capture group ends.

Is my description of the regex accurate?

It sounds like my understanding of what I wrote may be incorrect and the regex is skipping the first period in favor of the last period in the phrase. If so, why does it do that?

Thanks again!
David

Note that part of this is actually meaningless if you think about it. What does it mean to select “any set of characters that are not a forward slash”? Specifically, what do you mean with “any”? Is it one character? Two? You’re selecting something, it needs to be properly determined.

Other than the issue above, your understanding is correct.

Why the regex engine is favoring one of multiple ways of matching is very much related to what I commented above. There are in fact multiples ways of matching, as you implied with the use of the word “any”. Why should it do what you want? :slight_smile:

The * symbol is what we call greedy: it tries to match as much as possible just as long as the rest of the pattern provides a match. That’s why is ends up matching up to the last . it finds that still makes things work.

The Python documentation does a good job of explaining this, so I’ll just paste the relevant part here:

image

This post should also be very helpful in understanding some of the nuances of what we’ve been discussing.