369/8 Extracting Domains from URLs - what's wrong with my Regex here in using Lookbehind

Hi i know my question is similar to threads


i tried this

(?<=//)[\w+.]+

in https://regexr.com which seems to work fine in extracting

https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429’,
http://www.interactivedynamicvideo.com/’,
http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0’,
http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/’,
HTTPS://github.com/keppel/pinn’,
Http://phys.org/news/2015-09-scale-solar-youve.html’,
https://iot.seeed.cc’,
http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html’,
http://beta.crowdfireapp.com/?beta=agnipath’,
https://www.valid.ly?param

however, when i run it in Jupyter notebook, i get


ValueError Traceback (most recent call last)
in
----> 1 test_urls_clean = test_urls.str.extract(pattern_url, expand=False)

~/anaconda/lib/python3.6/site-packages/pandas/core/strings.py in extract(self, pat, flags, expand)
2765 @copy(str_extract)
2766 def extract(self, pat, flags=0, expand=True):
-> 2767 return str_extract(self, pat, flags=flags, expand=expand)
2768
2769 @copy(str_extractall)

~/anaconda/lib/python3.6/site-packages/pandas/core/strings.py in str_extract(arr, pat, flags, expand)
848 return _str_extract_frame(arr._orig, pat, flags=flags)
849 else:
–> 850 result, name = _str_extract_noexpand(arr._parent, pat, flags=flags)
851 return arr._wrap_result(result, name=name, expand=expand)
852

~/anaconda/lib/python3.6/site-packages/pandas/core/strings.py in _str_extract_noexpand(arr, pat, flags)
711
712 regex = re.compile(pat, flags=flags)
–> 713 groups_or_na = _groups_or_na_fun(regex)
714
715 if regex.groups == 1:

~/anaconda/lib/python3.6/site-packages/pandas/core/strings.py in _groups_or_na_fun(regex)
686 “”“Used in both extract_noexpand and extract_frame”""
687 if regex.groups == 0:
–> 688 raise ValueError(“pattern contains no capture groups”)
689 empty_row = [np.nan] * regex.groups
690

ValueError: pattern contains no capture groups

Can someone walk through with me what is wrong in my regex pattern here :grinning:

Hi @willx

After seeing the error, according to me:

ValueError: Says that pattern does not contain any capture group and you are using extract method to match pattern(In first line of error).

pattern = r'https?://([\w\.]+)'  # pattern to extract domain from url.
Series.str.extract(pattern, flags=re.I)  # ignorecase flag to avoid Https, HTTPS.

Extract method will return the string, which match the capture group (expression inside of bracket) of pattern.

Here, Capture group is “[\w\.]+” .

Hope this help :slightly_smiling_face: .

2 Likes

Hi @Prem, i know what the answer is. what i am trying to find out is why, we can’t just capture at (?<=//) instead of having to look at https?://

(?<=abc)xyz --> Matches xyz only when it is preceded by ‘abc’.

Use pattern = r’((?<=//)[\w\.]+)’

On covering expression in brackets extract will understand, extract [\w\.]+ only when it is preceded by // .

yes i know that because i tried using that on https://regexr.com and it worked there, but it doesn’t work on Jupyter notebook. that’s what i am trying to find out where the differences is

Hey @willx

I tried on my local computer Jupyter Notebook, it worked perfectly. You can see the below image.

hi so sorry, was taking a hiatus for the past 2 weeks. Happy New year by the way :slight_smile:

yeah, it is a mistake on my part, i missed out the parentheses in my earlier

i did this

(?<=//)[\w+.]+

what you have shown me is that i missed out here

pattern = r"((?<=//)[\w+.]+)"

thanks it works now. boy what a bummer. thank you so much

1 Like

It happens sometimes. And according to Indian standard time, we have 4 hours for the new year, but for you Happy New Year In Advance :smile: .