31 Years of Python | 48 Hour Sale Extension!!!
days
hours
minutes
seconds

Advanced RegEx - 8. Extract URL - How can it cut right before the path?

Screen Link:

Corrent answer from DQ:

pattern = r"https?://([\w\-\.]+)"

I…still don’t understand how the capturing group can “cut out” the domains right before the “/” and the paths?

In my logic, the pattern above should…brings EVERYTHING after the protocol…so my wrong answer actually included something like [com, org] at the very end of the pattern.

Please help me understand this;;;;

Thanks a lot in advance…!

Hi @bhw2690 and welcome back to the community!

This is a very good question and shows me that you’re almost there…you just need to follow this question up with this one: “What does my capture group actually capture and why isn’t it capturing the forward slash and everything after it?”

So let’s breakdown what your capture group captures! It captures one of three things: \w or \- or \. Let’s skip the first one and just examine the second two. \- and \. are escaped characters, meaning your capture group will match on either a literal hyphen (-) or a literal period (.)

Now what about that first one? \w matches on what exactly? I think if you can answer this question and combine it with the above observations, you’ll have your answer! If not, let me know and we can try something else to help answer your question.

2 Likes

Oh…! \w only covers alphabets and numbers. Not everything…!
I must memorized the definition wrong.

Thank you @mathmike314 !

Yes, exactly! You’re welcome and I’m glad you figured it out. To be honest, I thought the same thing when I did these exercises and had the same “lightbulb” moment you did regarding \w (I thought it covered ALL characters too).

4 Likes

Hello Mike, I also was curious what the + at the end is for? I’ve seen this within sets. Does it just mean more than one for the entire set the \w, -, . ? Sorry if that is confusing. I need to utilize the community more.

@drewlujan33, I’ll try to answer your question on Mike’s behalf.

Yes, it means one or more instances of any word characters a-zA-Z0-9_ as mentioned here.

Sets as in those denoted with curly brackets? i.e.{...}

Honestly, I couldn’t think about this solution like this

I tried to fit each word before \., and try so many ways but the code couldn’t fit both cases
I honestly want to ask you for the reason why you can think about this solution, I want to learn your thinking logic to apply for so many cases after then

My apologies for the late response and I’m afraid I do not have a “perfect answer” for you here. That said, I think it’s important to think about RegEx like it’s a language; when you first learn the basics, it’s difficult to have a detailed in-depth conversation. But, with more and more practice it becomes easier and easier. Have you tried using a tool like:

These kinds of tools give excellent real-time feedback which can help you learn the logic that RegEx uses. My best advice: practice a lot and just keep going! :sunglasses:

3 Likes