Screen Link:
https://app.dataquest.io/m/369/advanced-regular-expressions/8/extracting-domains-from-urls
My Code:
test_urls = pd.Series([
'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
'http://www.interactivedynamicvideo.com/',
'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
'HTTPS://github.com/keppel/pinn',
'Http://phys.org/news/2015-09-scale-solar-youve.html',
'https://iot.seeed.cc',
'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
'http://beta.crowdfireapp.com/?beta=agnipath',
'https://www.valid.ly?param',
'http://css-cursor.techstream.org'
])
pattern = r"//(\w+\-?\w+?\.?\w+\.?\w+\.?\w+)"
test_urls_clean = test_urls.str.extract(pattern, flags=re.I)
test_urls_clean
domains = hn['url'].str.extract(pattern,flags=re.I)
top_domains = domains.value_counts()
top_domains.head(5)
What I expected to happen:
github.com 1008
medium.com 825
www.nytimes.com 525
www.theguardian.com 248
techcrunch.com 245
What actually happened:
top_domains is longer than we expected.
I m not sure what is going wrong, the regex pattern r"//(\w+\-?\w+?\.?\w+\.?\w+\.?\w+)"
which I have created gives same output than regex pattern r"https?://([\w\-\.]+)"
given in answer, than also in console it is stating top domains is longer than we expected, also it shows error for domain variable,
Please help me to understand difference between both the regex pattern
have attached snap for reference-