Errors in Advance Regular Expression Exercise 8 - Extracting Domains from URLs

There seems to be a mistake in the answer for the domains variable in this exercise.
The answer for variable domains is 20099:
Series (<class ‘pandas.core.series.Series’>)

But a check on the dataframe:
RangeIndex: 20099 entries, 0 to 20098
Data columns (total 7 columns):
id 20099 non-null int64
title 20099 non-null object
url 17659 non-null object
num_points 20099 non-null int64
num_comments 20099 non-null int64
author 20099 non-null object
created_at 20099 non-null object
dtypes: int64(3), object(4)
memory usage: 1.1+ MB
None
There are only 17659 of them. Thus there is some mistake in the answer provided

This is because there are 2440 null values in the url column. When you use hn.info(), you’re only seeing the count of non-null objects in each column.

The output of the following code will confirm this:

domains = hn['url'].str.extract(pattern)
null_count = domains.isnull().sum()
print(null_count)

Thus I should rewrite the regex pattern in order to cater for those null values in the url column?

There isn’t any need to re-write the pattern because there’s simply nothing to retrieve from the null values! :joy:

Hope this helps!

So here is the issue, the answer expect to see 20099 but there is only 17659 of url that is not null thus it is not giving me a completion tick.
The rest are all fine.

Is that the case for you even when you copy-paste the answer provided in the solutions? It seems to work fine for me.

If you’re using your own code, can you copy-paste it over here?

I am using my code. Here they are:

pattern = r'http[s]?\:\/\/.*?([^\/]+)'
test_urls_clean = test_urls.str.extract(pattern, flags=re.I, expand=False)
domains = hn['url'].str.extract(pattern, flags=re.I)
top_domains = domains.value_counts().head(20)

It looks like the pattern you wrote might not be entirely correct. Compare the test_urls_clean variable returned by your code to the one that the answer expects:

The 10th url has a “?param” at the end that it shouldn’t have.

Seems like there is a copy-and-paste error in my last posting.


The screenshot show what the pattern that I used but as you can see, the domain is deemed wrong.

@vincentmkh Sorry for the late response! It’s possible that there might be a domain name inside domains that isn’t caught by the test_urls_clean variable.

As for copy-pasting, remember that you can make code completely readable and copy-pastable by enveloping them in 3 back ticks (```). So ```abc```, if enclosed in 3 back-ticks, will appear as a coded block: abc

In this case I already edited your earlier post and did that for you so I could copy-paste your exact code. You can check the formatting I added to see how I did it.

As to your code, try the following:

pattern = r"https?://([\w\.]+)"
test_urls_clean = test_urls.str.extract(pattern, flags=re.I)
domains = hn['url'].str.extract(pattern, flags=re.I)
top_domains = domains.value_counts().head(20)

pattern = r'http[s]?\:\/\/.*?([^\/]+)'
test_urls_clean = test_urls.str.extract(pattern, flags=re.I, expand=False)
domains2 = hn['url'].str.extract(pattern, flags=re.I)
top_domains2 = domains.value_counts().head(20)

print(pd.concat([domains,domains2]).drop_duplicates(keep=False))
print('\n')
print("No. of different rows:", len(pd.concat([domains,domains2]).drop_duplicates(keep=False)))

This will highlight to you which rows you’re getting differently from the model answer, so you’re better able to individually trouble-shoot them and figure out why your written pattern didn’t work!

Found actual thread and also posted there: 369-8 Advanced regular expressions - Mission 8 Extracting Domains

Hello!

I also cannot pass the step “8.Extracting Domains from URLs” cause domains Series doesn’t match.

image

And I cannot use the answer provided as it has wrong pattern (pattern = r"https?://([\w.]+)" extracts “css” from “http://css-cursor.techstream.org”)

My pattern = r"https?://(.[^/]+.\w+)\b"

Please advise.

Maxim.