369-8 Extracting Domains from URLs

Hey everyone,
I tried to solve a problem from the Advanced Regular Expressions. The problem is from the step 8. I’ve written down my code. But the inspector says that the variable, domains, is wrong desoite the fact that the top_domains I wrote is correct. I do not know why I am wrong. So I need some help.

The following is my code.

test_urls = pd.Series([
https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429’,
http://www.interactivedynamicvideo.com/’,
http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0’,
http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/’,
HTTPS://github.com/keppel/pinn’,
Http://phys.org/news/2015-09-scale-solar-youve.html’,
https://iot.seeed.cc’,
http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html’,
http://beta.crowdfireapp.com/?beta=agnipath’,
https://www.valid.ly
])
pattern = r"https?://(\w+.\w+[.\w]+)"
test_urls_clean = test_urls.str.extract(pattern, flags=re.I)

domains = hn[‘url’].str.extract(pattern, flags = re.I)
top_domains = domains.value_counts().head(20)

source: https://app.dataquest.io/m/369/advanced-regular-expressions/8/extracting-domains-from-urls

This code gives an error.

Back ticks is an issue.

‘url’ 

Use double " or ’ single quote:

domains = hn["url"].str.extract(pattern, flags = re.I)

Your pattern is incorrect.

\w+ matches 1 or more words character

  • w matches any word character (equal to [a-zA-Z0-9_])
  • + is a quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

. matches any character (except for linea terminators)

[.\w]+ Match a single character present in the list below

  • + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
  • . matches the character . literally (case sensitive)
  • \w matches any word character (equal to [a-zA-Z0-9_])

Your url pattern are as follows:

  1. \w+ := 1 or more words
  2. . := any single character
  3. \w+ := 1 or more words
  4. [.\w]+ := 1 or more characters - any type of characters since \w is a subset of . ; therefore any characters are accepted.

The 2nd part of the regex may cause a problem because it allowed any type of character. However, we want to differentiate an actual dot . instead of . universal any single character.

The 3rd part of the regex may cause a problem because a domain name might not have more words.

The 4th part of the regex may cause a problem because of the forward slash / is being accepted which we do not want.

To fixed, use the following:

pattern = r"https?://([\w\.]+)"

The domain name can be a multiple of words and/or \. .

2 Likes