Regex Solution not being accepted as correct answer

Screen Link:
https://app.dataquest.io/m/369/advanced-regular-expressions/8/extracting-domains-from-urls

My Code:

#pattern = r'https?://([\w-]+\.[\w-]+\.?[\w-]?)' (Please ignore this)
Edit - r'https?://([\w-]+\.[\w-]+\.?[\w-]*)' 
test_urls_clean = test_urls.str.extract(pattern, flags=re.I)
domains = hn['url'].str.extract(pattern, flags=re.I)
top_domains = domains.value_counts().head(5)

What I expected to happen:
The solution above gives the correct results and should be accepted.

What actually happened:


No error message is shown but until we give the regex pattern as in the solution answer, it is not allowing to submit and pass the exercise.

Your code shouldn’t be given you those results.

Here’s what I see on my end:

Hi Bruno,
I am sorry, I posted the other pattern, this is the one I was trying -
pattern = r’https{0,1}://([\w-]{1,}.[\w-]{1,}.{0,1}[\w-]{0,})’
It is same as r’https?://([\w-]+.[\w-]+.?[\w-]*)’ In the topic post I mistakenly wrote ? instead of * at the end.

The pattern https{0,1}://([\w-]{1,}.[\w-]{1,}.{0,1}[\w-]{0,}) doesn’t yield those results either.

1 Like

Hi @Bruno, somehow posting the reply removed the backslashes present before the ‘.’ character in the pattern as the code was not between backticks, apologies for the confusion, I have edited the pattern in the main post. Here it is again -

pattern = r'https?://([\w-]+\.[\w-]+\.?[\w-]*)'

Here’s a sample of the differences between Dataquest’s solution and yours:

Dataquest Sohamdey
www.cam.ac.uk www.cam.ac
www.independent.co.uk www.independent.co
mint.lc.intuit.com mint.lc.intuit
www.bbc.co.uk www.bbc.co
www.eecs.berkeley.edu www.eecs.berkeley
1 Like

Hi Bruno, Thanks a lot for pointing it out. So, if I put another set of ‘\.[\w-]*’ at the end,
i.e

https?://([\w-]+\.[\w-]+\.?[\w-]*\.?[\w-]*)

it would cater these cases, but would make it more complex.
Is it better to avoid this completely?

Ok, I found out, again this will not be matched - www.keith.seas.harvard.edu

Yes, I don’t think this is going in the right direction. Take a look at what happens with http://ftp://ftp.tcl.tk/pub/incoming/p15/RichardHipp/microoptimization/paper.html. Your technique will never match this.

To be fair, however, Dataquest’s solution also fails as it says the domain is ftp, when it seems it should be ftp:.

1 Like