VERY unclear about why this doesn't work as a solution

Hi! I am VERY confused. I used https://regexr.com/ to check my code as well as the answer for the page below. When I input the ‘solution’ that is listed on the page, I get an error on https://regexr.com/. However, when I input the code below, it works on https://regexr.com/ with the sample data, but then causes an error through this exercise. How can this be? Is there something wrong with the pattern I create? It’s hard to know what is right, or if this lesson contains accurate information.

Screen Link: Learn data science with Python and R projects

My Code:

pattern = r'[[^\/]([\w\-.]*(?:org|com|cc|ly))'

The reason for the error is related to how regexr.com deals with special characters in their environment vs the Dataquest python environment. To get the solution to work on regexr.com, you simply need to escape the two forward slashes and it should be fine.

Remember that the sample data is not as complex as the actual data in the exercise. For example, in the sample data, domain names end in either .org, .com, .cc, or .ly but in the actual dataset, websites can also end in .sm and .net as well as others. Therefore, it’s possible for a pattern to work on the sample but not on the actual data.

I’m also curious about the first token in your pattern: [[^\/]. What is the purpose of this? What is it supposed to accomplish? Reading from regexr.com it says that this character set will match (literally) on [ or ^ or /. It’s unclear to me why this approach is useful for finding websites in our dataset. I think a better approach is to begin your pattern with something like http since all websites in our dataset begin with these characters.

I think your pattern is a little too rigid and doesn’t allow for enough variety in matching cases that aren’t included in the sample data. I would suggest reworking the beginning and end of your pattern to get better results. The token in the middle looks good though! (ie [\w\-.])

Okay. Well then, why would the test set not include an inclusive set of all possible domain endings? Also not sure why the proposed tool (regexr) would work with wrong solutions, but not work for actual solutions. I think this lesson needs to be reviewed TBH.

I think this is the case because a “test set” is always smaller than the actual data. It’s just meant to get us started on finding a solution and is not meant to be a “if it works on the test data then it’s for sure going to work on the actual data.” If a test set includes every possible situation that the actual data has, aren’t we better off just using the actual data in the first place? :man_shrugging: