369-8 Advanced regular expressions - Mission 8 Extracting Domains

Mission Link: Learn data science with Python and R projects

when my used regex pattern did not bring the expected results, I had a look at the pattern provided in the answer. Now I am confused, as the provided pattern does not cater for points, hence provides www as domains as well, which than results in the top 5 domains looking like

www 7239
github 1008
medium 829
blog 661
techcrunch 245

Using my pattern (r"https?://([-\w.\?]+)") would (at least in my opinion) produce a more fitting top 5 domains extracted form the hn dataset

github.com 1008
medium.com 825
www.nytimes.com 525
www.theguardian.com 248
techcrunch.com 245

Additionally, using my pattern it is possible to correctly extract all domains from test_urls, while using the pattern provided in the answer again gives results which (in my opinion) seem rather strange:

0 www
1 www
2 www
3 evonomics
4 github
5 phys
6 iot
7 www
8 beta
9 www
10 css-cursor

However, as I am fairly new to Python I might have got it all wrong :wink:

Any ideas?



Hi Christoph, welcome to the forums! I see what you mean! It looks like the pattern in the solution has a typo, a - character instead of a . character. I’ll tag @Sahil so the team can have a look. Thanks for letting us know!

1 Like

I encountered a problem like you.
But I use r'https?://([^/?]+)' pattern
My results are like you. I’m so confused that I can’t complete this mission.
The team, please fix quickly.


I encountered the same problem too

1 Like

Hi @Christoph, @witsavachit.sub, @zulfaromadhoni1123,

Yes the pattern used in dataquest solution is incorrect. I have logged this issue. Thank you for bringing this up.


1 Like

the solution should be: r’https?://([\w.-]+)’

Ok. Since I am also a newbie and facing the same issue as in this post, i used all of my brains to come up with following pattern,

pattern = r"(?<=/)\b([^\d]\w+[.-]?\w+[.]\w+)"

I cannot understand how dataquest solution will match the required format. Also, RegExr is not matching the format with dataquest pattern.
Can someone answer me following:

  1. Why “/” is not escaped in dataquest pattern?
  2. Is my pattern correct? RegExr says its correct and i get the domains correct as expected.

Thank You


I also cannot pass the step “8.Extracting Domains from URLs” cause domains Series doesn’t match.

Worse I cannot recognise the issue:


My pattern = r"https?://(.[^/]+.\w+)\b"

Please advise.