I used the following pattern to extract the domain of a url:
pattern = r'https?://([^/?]+)'
Basically excluding ‘/’ and ‘?’ and capturing the text within
The pattern used in the solution:
pattern_og = r"https?://([\w\-\.]+)"
Including nothing but words, ‘-’ and ‘.’ characters.
It is hard to identify where my pattern might fail so if someone can lend some insight to this, it will be really helpful. I do think that my code might include other characters that are excluded in the solution’s pattern but why would that cause a problem is beyond my understanding. Thank you in advance
While your pattern is perfect for all the links from
test_urls and also for the majority of the links in
hn['url'], it still doesn’t detect some symbols, which should not be included in the domain names. In particular, these symbols:
#. There are very few rows with these symbols, but they are.
Thank you so much. This definitely helps.