8. Extracting Domains From URLs: Regular Expressions

Screen Link:
Advanced Regular Expressions In Python — Extracting Domains From URLs | Dataquest

My Code:

pattern = r"(\w+\-?\w+\.?\w+\.\w+)"
test_urls_clean = test_urls.str.extract(pattern, expand=False)
domains = hn['url'].str.extract(pattern, flags=re.I, expand=False)
top_domains = domains.value_counts().head()

What I expected to happen:
Everything to pass and work.

What actually happened:
There is no error code BUT there is a light blue color text indicating what the answer should be for ‘domains’. BUT my top_domains AND test_urls_clean ARE both RIGHT. So? What is the error?

Heres the error image:
error

The last question is Use Series.value_counts() to build a frequency table of the domains in domains, limiting the frequency table to just to the top 5. Assign the result to top_domains.`

The answer should be for top 5 domains

I noticed that you didn’t include “https” in your pattern. I was inclined towards doing the same but the DQ answer suggests that we do include it. I am confused as to why we include http in our code and if we do, why does the answer still exclude it from the frequency table. Any thoughts?

The domain immediately follows the protocol. By including the protocol in the regular expression, we greatly reduce the work required in constructing our capture group. Otherwise our capture group has to be able to explain when we don’t want to match details in the page path while still matching details in the domain.

By contrast, the protocol follows a highly predictable format whose variations are matters of case and a single, pre-determined letter. If a URL follows the format [Protocol][Domain][Page Path] then it’s very easy to explain what the domain is because it will always follow the protocol. Our expression reduces to explaining [this is what a protocol looks like][capture all this stuff following the protocol].

With the OP’s approach, he tries to define the domain by constructing a sequencing of characters as strictly obeying some format with hyphens and periods. His intuition that a single required period (to represent a .com, for example) fails to distinguish between domains and page paths as we see from this URL:

http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0

“07stein.html” is not the domain but his pattern matched it. However, had we simply said the match needs to precede the highly predictable “http://” protocol, then it would have been immediately eliminated as a match for not following a protocol format.

One final note: not all domains are going to match his pattern, either. His pattern essentially says “sometimes there’s 1 hyphen” when in reality there could be two hyphens. OlutokiJohn’s reponse is incorrect: the OP’s code correctly extracted the top 5 domains. The error is in extracting the domain for the entire dataset.

2 Likes