Advanced Regular Expressions: Extracting Domains from URLs

Screen Link:

My understanding is that :// matches the code literally in this pattern r"https?://([\w\-\.]+)" which means this pattern should select the colon and two forward slashes part of the url but it doesn’t. See below output.

This is what my understanding of the pattern is:

  • https? means it’s optional, matches the preceding character zero or many times.

  • :// matches the characters literally so should be part of the output like this ://

  • ([\w\-\.]+) group capturing, captures whatever is inside the parenthesis.

Please help me understand this, correct me if I misunderstood the whole pattern entirely :grinning:.

Thanks a lot.

You’re understanding is close but perhaps just needs a bit of tweaking…

The ? means to match the previous character zero or once. So https? will match http and https but technically speaking doesn’t match httpss. (Note: it will match on the first five characters but not on that last s. ie: It will match on httpss)

You are correct, it does match those characters literally…however, they aren’t part of your capture group and thus will not be displayed with the name of the website.

Correct! The big question is: what will this capturing group match on?

1 Like

Hey @mathmike314

Thanks a lot for clearing up my mental block, really appreciate you taking the time and dissenting it for me.

I think I am good for this one ([\w\-\.]+)

  • () Capturing group

  • [] Set

  • \w Matches any word character [a-zA-Z0-9_]

  • \. Matches the character literally

  • \- Matches the character literally

  • + Matches the previous b/w one and many times.

1 Like

Thank you, my pleasure to have been some help to you! Your understanding looks solid, well done. Although, I’m not sure what “b/w” refers to…

I’m not sure if you’ve seen this site yet or not but it can be a really helpful tool.