369-8 Advanced regular expressions - Mission 8 Extracting Domains

Mission Link: https://app.dataquest.io/m/369/advanced-regular-expressions/8/extracting-domains-from-urls

Hi,
when my used regex pattern did not bring the expected results, I had a look at the pattern provided in the answer. Now I am confused, as the provided pattern does not cater for points, hence provides www as domains as well, which than results in the top 5 domains looking like

www 7239
github 1008
medium 829
blog 661
techcrunch 245

Using my pattern (r"https?://([-\w.\?]+)") would (at least in my opinion) produce a more fitting top 5 domains extracted form the hn dataset

github.com 1008
medium.com 825
www.nytimes.com 525
www.theguardian.com 248
techcrunch.com 245

Additionally, using my pattern it is possible to correctly extract all domains from test_urls, while using the pattern provided in the answer again gives results which (in my opinion) seem rather strange:

0 www
1 www
2 www
3 evonomics
4 github
5 phys
6 iot
7 www
8 beta
9 www
10 css-cursor

However, as I am fairly new to Python I might have got it all wrong :wink:

Any ideas?

Cheers!

4 Likes

Hi Christoph, welcome to the forums! I see what you mean! It looks like the pattern in the solution has a typo, a - character instead of a . character. I’ll tag @Sahil so the team can have a look. Thanks for letting us know!

1 Like

I encountered a problem like you.
But I use r'https?://([^/?]+)' pattern
My results are like you. I’m so confused that I can’t complete this mission.
The team, please fix quickly.

2 Likes

I encountered the same problem too
dataquest

1 Like

Hi @Christoph, @witsavachit.sub, @zulfaromadhoni1123,

Yes the pattern used in dataquest solution is incorrect. I have logged this issue. Thank you for bringing this up.

Best,
Sahil

1 Like

the solution should be: r’https?://([\w.-]+)’

Ok. Since I am also a newbie and facing the same issue as in this post, i used all of my brains to come up with following pattern,

pattern = r"(?<=/)\b([^\d]\w+[.-]?\w+[.]\w+)"

I cannot understand how dataquest solution will match the required format. Also, RegExr is not matching the format with dataquest pattern.
Can someone answer me following:

  1. Why “/” is not escaped in dataquest pattern?
  2. Is my pattern correct? RegExr says its correct and i get the domains correct as expected.

Thank You

Hello!

I also cannot pass the step “8.Extracting Domains from URLs” cause domains Series doesn’t match.

Worse I cannot recognise the issue:

image

My pattern = r"https?://(.[^/]+.\w+)\b"

Please advise.

Maxim.