Strange regex behaviour: Advanced regular expressions. Extracting URL Domains

Hi Community,

I’m wondering if it is really possible that I’ve come accross some regex bug.

Screen Link: https://app.dataquest.io/m/369/advanced-regular-expressions/8/extracting-domains-from-urls

My Code:

pattern = r"([^https?://][a-z.-]*)"

test_urls_clean = test_urls.str.extract(pattern, flags=re.I)

What I wanted:

I just followed recommendations provided for Screen 8. Everything worked fine for the test_urls_clean series except for one little domain phys.org.

What actually happened:

I cannot quite understand why, given my pattern, the `phys.org` domain has been extracted as 'ys.org'. Other URL domains came up perfectly in full.

Here is the screenshot of the test_urls_clean output (Please see Item 5 in the table):

I’m curious to know whether other students have come across this ‘bug’.

@homothety12345

What I noticed with the above pattern was that all the characters in the above set were excluded from the match at the beginning. The exclusion stopped when the pattern found some character not in the above. This was when it started to match this [a-z.-]*.

For example if you place www.phys.org..., it matches it correctly.

To test this, I tried the following and got this result.

import re
import pandas as pd
test_urls = pd.Series([
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param',
 'http://css-cursor.techstream.org',
 'https://sss-sss.org',
 'http://ssssss.org'
])

pattern = r"([^https?://][a-z.-]*)"

test_urls_clean = test_urls.str.extract(pattern, flags=re.I)
print(test_urls_clean)

Output:
                                  0
0                    www.amazon.com
1   www.interactivedynamicvideo.com
2                   www.nytimes.com
3                     evonomics.com
4                        github.com
5                            ys.org
6                      iot.seeed.cc
7                  www.bfilipek.com
8             beta.crowdfireapp.com
9                      www.valid.ly
10        css-cursor.techstream.org
11                         -sss.org
12                             .org

Hello @homothety12345,

There is a similar topic to yours here.
Hopefully, this will pique your interest!

1 Like

Hi,

Thank you for your answer! It has been enlightening to me, especially with the test URLs (sss-sss.org and sss.org). When testing my pattern, I completely forgot that I could cook up some test URLs.

However, I have also noticed that even with the negative set at the start my pattern ([^https?://]) that ‘spoils’ phys.org, there are p and h letters present in the other URLs (e.g beta.crowdfireapp.com or www.bfilipek.com), which could also be expected to become eliminated through the use of the negative set at the start of my regex pattern.

I also admit that as a data science beginner I may not be aware of some depeer intricacies posed by regex, which may look quite convoluted to an undiscerning eye at the first sight.

I have also a side question about Screen 9 of Advanced Regex Expressions: https://app.dataquest.io/m/369/advanced-regular-expressions/9/extracting-url-parts-using-multiple-capture-groups

I have finally come up with this pattern: r"(https?)://([\w\.\-]+)/?([\w\.\-\/\=\?]+)?"
The pattern has the second capture group set as optional. As a result, it returns NaN whenever there is no match. With this pattern, my result was only different to the point that the resulting dataframe contained NaN values instead of empty cells suggested by the answer to the exercise.

Therefore, the question is whether it is better to have empty cells instead of NaN values for some reasons.

Hello @homothety12345,

Great :clap:t3:, as your pattern captures the domain correctly.


To access just the domain name, you’ll just have to say test_urls_clean[[1]], or better yet, just capture only the domain part with pattern = 'https?://([.\-\w]+)/?.*'


Now for the screen 9. link, you can tweak your regex to keep the 3rd capture group simple:
pattern = '(https?)://([.\-\w]+)/?(.*)
This should take care of NaN's you were getting earlier!

Hi Sanjeeve,
Thank you for the detailed explanation. Having that third capture group free from ‘?’ has fixed the ‘NaN’ problem. I just wonder why one may need a daraframe with empty cells, rather than NaNs clearly saying “no data was available for this portion of the URL.”

1 Like

Immediately the pattern meet any character not in the first set, it stops matching from it. It starts matching from the second set.

It stops matching from this first set from www. for the instance above.

You can write as: r"(https?)://([\w\.\-]+)/?([\w\.\-\/\=\?]*)?" . This takes care of the NaN since you now match 0 or more. With the +, you have to match at least one. This pattern works for the test_url_parts but is still not general enough to match the url_parts.

Experimentation is important for learning, you are doing a nice job!

1 Like

Nice! NaN's kept pondering you.

From the top of my head, I would say that if the pattern can match the non-existence of characters, then you would get an empty string (like with the use of * in capture-group-3 (.*)) because it was a match.

If you were to use a + (as in your search pattern), then the group looks for at least 1 or more of the search character, which means it doesn’t satisfy the capture group, and hence a NaN!

Experiment more, you’ll see it yourself.
Keep it up, this thread was a good one.

1 Like