Extract Domains using regex

Screen Link:

My Code:

test_urls = pd.Series([
pattern = r"//(\w+\-?\w+?\.?\w+\.?\w+\.?\w+)" 
test_urls_clean = test_urls.str.extract(pattern, flags=re.I)

domains = hn['url'].str.extract(pattern,flags=re.I)
top_domains = domains.value_counts()

What I expected to happen:
github.com 1008
medium.com 825
www.nytimes.com 525
www.theguardian.com 248
techcrunch.com 245

What actually happened:

top_domains is longer than we expected.

I m not sure what is going wrong, the regex pattern r"//(\w+\-?\w+?\.?\w+\.?\w+\.?\w+)" which I have created gives same output than regex pattern r"https?://([\w\-\.]+)" given in answer, than also in console it is stating top domains is longer than we expected, also it shows error for domain variable,
Please help me to understand difference between both the regex pattern
have attached snap for reference-

1 Like

Here’s a hint.

Take a look at the index 20011 in hn — the URL is https://www.linux-toys.com/?p=374.

What do you think the solution should capture? What does your solution capture?

1 Like

Hi Bruno,
Thank you for looking into it, the url shared for reference is not available, though i will again check my regex pattern as per your guidance

1 Like

What do you mean it’s not available? I see it:


If it isn’t there for you, there’s something else going on.

1 Like

Thank you for coming back,
I thought you are referring me a url to browse, so I said it is not available, as I was not able to browse
Now I got it, you were referring me to url in dataset at given location