Extracting domains from URL using RegEx

Screen Link:

I used the following pattern to extract the domain of a url:

pattern = r'https?://([^/?]+)'

Basically excluding ‘/’ and ‘?’ and capturing the text within

The pattern used in the solution:

pattern_og = r"https?://([\w\-\.]+)"

Including nothing but words, ‘-’ and ‘.’ characters.

It is hard to identify where my pattern might fail so if someone can lend some insight to this, it will be really helpful. I do think that my code might include other characters that are excluded in the solution’s pattern but why would that cause a problem is beyond my understanding. Thank you in advance

1 Like

Hi @kaushalanshul29,

While your pattern is perfect for all the links from test_urls and also for the majority of the links in hn['url'], it still doesn’t detect some symbols, which should not be included in the domain names. In particular, these symbols: : and #. There are very few rows with these symbols, but they are.

print(hn['url'].iloc[336])
print(hn['url'].iloc[6759])
print(hn['url'].iloc[7298])
print(hn['url'].iloc[13669])

Output:

http://readthisthing.com#
http://ftp://ftp.tcl.tk/pub/incoming/p15/RichardHipp/microoptimization/paper.html
http://dbweb.cs.uvic.ca:8080/servlet/MMPServlet?filename=quizscript.mmp
http://paradise.xxiivv.com:3000/
1 Like

Thank you so much. This definitely helps.

1 Like