Screen page: https://app.dataquest.io/m/369/advanced-regular-expressions/8/extracting-domains-from-urls
Here is my code:
test_urls = pd.Series([
'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
'http://www.interactivedynamicvideo.com/',
'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
'HTTPS://github.com/keppel/pinn',
'Http://phys.org/news/2015-09-scale-solar-youve.html',
'https://iot.seeed.cc',
'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
'http://beta.crowdfireapp.com/?beta=agnipath',
'https://www.valid.ly?param',
'http://css-cursor.techstream.org'
])
pattern = r'(?<=//)(.*\.\w{2,4})'
test_urls_clean = test_urls.str.extract(pattern)
print(test_urls_clean)
My pattern picks up all of the URLs correctly except for the 3rd and 7th URLs. What I would like to do is add something like (?!/)
or [^/]
. Basically, is there a way to write this using my syntax above such that the last part \w{2,4}
DOES NOT contain “/
”?
Thanks for the help!
David