What I expected to happen:
The correct answer given is : pattern= r"(https?)://([\w-.]+)/?(.*)"
but why is that when I add \b at the end ( to specify end of string ), it says my output is wrong?
What actually happened:
Value of url_parts is not what we expected.
Although it’s tricky to spot the difference, there is a difference in output. Here is the code I used to figure out the difference between using
\b and not using it:
pattern1 = r'(https?)://([\w.\-]+)/?(.*)'
pattern2 = r'(https?)://([\w.\-]+)/?(.*)\b'
test_url_parts1 = test_urls.str.extract(pattern1, flags=re.I)
test_url_parts2 = test_urls.str.extract(pattern2, flags=re.I)
df_diff = pd.merge(test_url_parts1, test_url_parts2, how='outer', indicator='Exist')
df_diff = df_diff.loc[df_diff['Exist'] != 'both']
Which produced the following result:
0 1 2 Exist
3 http evonomics.com advertising-cannot-maintain-internet-heres-solution/ left_only
10 http evonomics.com advertising-cannot-maintain-internet-heres-solution right_only
Now we can see the difference! Using a word boundary in our
pattern means it will not capture non-word characters like
/ at the end of the string which are part of the page path.