What I noticed with the above pattern was that all the characters in the above set were excluded from the match at the beginning. The exclusion stopped when the pattern found some character not in the above. This was when it started to match this [a-z.-]*.
For example if you place www.phys.org..., it matches it correctly.
To test this, I tried the following and got this result.
Thank you for your answer! It has been enlightening to me, especially with the test URLs (sss-sss.org and sss.org). When testing my pattern, I completely forgot that I could cook up some test URLs.
However, I have also noticed that even with the negative set at the start my pattern ([^https?://]) that ‘spoils’ phys.org, there are p and h letters present in the other URLs (e.g beta.crowdfireapp.com or www.bfilipek.com), which could also be expected to become eliminated through the use of the negative set at the start of my regex pattern.
I also admit that as a data science beginner I may not be aware of some depeer intricacies posed by regex, which may look quite convoluted to an undiscerning eye at the first sight.
I have finally come up with this pattern: r"(https?)://([\w\.\-]+)/?([\w\.\-\/\=\?]+)?"
The pattern has the second capture group set as optional. As a result, it returns NaN whenever there is no match. With this pattern, my result was only different to the point that the resulting dataframe contained NaN values instead of empty cells suggested by the answer to the exercise.
Therefore, the question is whether it is better to have empty cells instead of NaN values for some reasons.
Great , as your pattern captures the domain correctly.
To access just the domain name, you’ll just have to say test_urls_clean[[1]], or better yet, just capture only the domain part with pattern = 'https?://([.\-\w]+)/?.*'
Now for the screen 9. link, you can tweak your regex to keep the 3rd capture group simple: pattern = '(https?)://([.\-\w]+)/?(.*)
This should take care of NaN's you were getting earlier!
Hi Sanjeeve,
Thank you for the detailed explanation. Having that third capture group free from ‘?’ has fixed the ‘NaN’ problem. I just wonder why one may need a daraframe with empty cells, rather than NaNs clearly saying “no data was available for this portion of the URL.”
Immediately the pattern meet any character not in the first set, it stops matching from it. It starts matching from the second set.
It stops matching from this first set from www. for the instance above.
You can write as: r"(https?)://([\w\.\-]+)/?([\w\.\-\/\=\?]*)?" . This takes care of the NaN since you now match 0 or more. With the +, you have to match at least one. This pattern works for the test_url_parts but is still not general enough to match the url_parts.
Experimentation is important for learning, you are doing a nice job!
From the top of my head, I would say that if the pattern can match the non-existence of characters, then you would get an empty string (like with the use of * in capture-group-3 (.*)) because it was a match.
If you were to use a + (as in your search pattern), then the group looks for at least 1 or more of the search character, which means it doesn’t satisfy the capture group, and hence a NaN!
Experiment more, you’ll see it yourself.
Keep it up, this thread was a good one.