Data Cleaning in Python: Advanced// Advanced Regex Q8

Screen Link:

Not sure what’s wrong with the regex below - trying to extract the domain name.
I’m trying to say that the domain name has to be preceded by (http[s]?) and followed by //?
What I am getting wrong, which lesson should I review.

My Code: <\\pattern = r’(?<= http[s]? //)([\w+])(? //)’

test_urls_clean = test_urls.str.extract(pattern, flags=re.I)\–>

Replace this line with your code

What I expected to happen:
Extract domain name from the below:

test_urls = pd.Series([
https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429’,
http://www.interactivedynamicvideo.com/’,
http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0’,
http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/’,
HTTPS://github.com/keppel/pinn’,
Http://phys.org/news/2015-09-scale-solar-youve.html’,
https://iot.seeed.cc’,
http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html’,
http://beta.crowdfireapp.com/?beta=agnipath’,
https://www.valid.ly?param’,
http://css-cursor.techstream.org
])

What actually happened:
Error: error: unexpected end of pattern

Replace this line with the output/error

Hello @kovarski.n,

It would be really helpful if you can post the screen link associated with your question.
This will help anyone trying to answer, to work on your question right in the browser terminal.

Your pattern rightly tries to capture 3 groups from a URL,
(https://)(1domain-name.com)(/don't-care-about-the-rest).
But, it isn’t working as you expected.

I would suggest you to split one simple URL at a time and try to arrive at its capture group:
https://iot.seeed.cc >> (https://) (iot.seeed.cc)

Then try the pattern on another one that has a non-word character like the -,
http://css-cursor.techstream.org >> (http://) (css-cursor.techstream.org)

Then try on yet another one that you can eyeball that deserves attention,
HTTPS://github.com/keppel/pinn >> (HTTPS://) (github.com) (/keppel/pinn)

As you build on the capture group for each case, you will see what is necessary.

If you haven’t tried your hand at regexr, you may want to give it a try to test a pattern.

pattern = r’(?<=//)?(\w+.?\w+.\w+)(?=/)?’

Here’s where I got to.
Still can’t get right this one:
http://css-cursor.techstream.org

I have checked the right answer - don’t get why the provided answer is right - not sure what ‘-’ part means.

Would appreciate additional guidance.

Thank you.
Nikolai

Hello @kovarski.n,

That was a good workout for me and learnt something in return :grinning:
Pattern (?<=//)([.\-\w]*)(?=/?) should do.

Group 1 (?<=//)

…looks for // and ignores everything preceding it (// inclusive)
You’ve taken care of this except the extra ? following it. It’s not required.
Reason being, a URL will begin with http(s); but your regex should recognize it, yet not capture.

Group 2 ([.\-\w]*) captures the domain, which is what you intended to capture.

See the escaping of -?
- has special meaning within [ ] (meaning=range; it would be looking for something like 0-9 or a-z), hence should be escaped with \
Also, a . can be part of the domain name. This is captured within the square brackets [ ].
When a . is within [ ], then it’s just a . (not the regex .)
So the allowable characters in a domain name are \w . - and an endless combination of these three that is shown by placing a * after the set [.\-\w]

Group 3 (?=/?)

…looks for a possible / following the domain name, and yet not capture it. Because it’s only the domain name you want.
As you might already know by now, /? looks for 0/1 occurrence of / following the captured domain name.

Let me know if my explanation helps!

Hello again @kovarski.n,

Apparently there’s a simpler regex that you could use, and it’s in another similar post.