Advanced Regex Expressions - Q. 5

I’m curious about a specific line of code in the solution from Advanced Regular Expresisons (screen 5)

Screen Link: https://app.dataquest.io/m/369/advanced-regular-expressions/5/using-lookarounds-to-control-matches-based-on-surrounding-text

Here is the code listed in the solution pattern I’m curious about :
pattern = r"(?<!Series\s)\b[Cc]\b((?![+.])|.$)"

Here is the code I used: pattern = r’(?<!Series\s)\b[Cc]\b(?![+.])’

I got the correct answer through my code, so I’m curious what the final few characters add : |.$

From my best guess, it’s meant to only (not) match the previous expression at the end of the string, ie a situation where a title ends C. or c.? It seems a bit superfluous in that case since the preceding expression should eliminate those cases?

1 Like

Hi @ryan.wetherbie,

Welcome to the community.
I’ve checked my code and I have used a similar regex that you have written and the answer checking gave me a green light. So I didn’t check the regex given in the solution. It looks like there was already a discussion going on.

Please have a look.

1 Like

Hi @ryan.wetherbie,

I was about to post the same topic when I saw your post.

The pattern you used is similar to what I had:

pattern1 = r"(?<!Series\s)\b[Cc]\b(?![+.])"

While this pattern worked for this particular exercise, I realized that the proposed answer provided by Dataquest:

pattern2 = r"(?<!Series\s)\b[Cc]\b((?![+.])|\.$)"

is more correct in general.

This is because the pattern1 will not match cases where the character [Cc] is at the end of the sentence followed immediately with a period "." (because of the negative lookaround (?![+.]).

Hence, a string with the following value:

string1 = "I find it difficult to learn C."

will not be matched using pattern1 but pattern2 will be able to.

This is because pattern2 tells the program to capture instances where [Cc] is not followed by the characters "+" or "." as represented by the negative lookaround (?![+.]) OR where [Cc] is followed by the character "." that is immediately followed by the end of the line or the string (\.$).

pattern1 works for this exercise because it just so happens that the data set we’re working with does not have cases such as the string1 example I gave where "C" is at the end of the sentence.

I hope the instructions and the prompt for this screen is updated because it is really very confusing and took more time to figure out than I would have preferred.

8 Likes

I also had problems with Q5 and consider the instruction as imprecise, ambiguous and needs to be improved.

  • Exclude instances where it is followed by a . or + character, without removing instances where the match occurs at the end of the sentence.
  • “the match” it is not well defined what is considered as “the match” here: It is implied here that the end of the sentence is defined by a dot. However what is meant here is: The end of the string in titles and not the end of a sentence of a particular string.
  • strings in titles do not end with a dot. Hence a situation is assumed which is not present in the data and it is very unlikely that the correct unique answer is found given titles. Thats why (?<!Series\s)\b[Cc]\b((?![.+])|.$)and(?<!Series\s)\b[Cc]\b(?![.+])` would not differ in results
1 Like


I tried to use the solution code to select the title with ‘C.’ at the end of the sentence but I failed to do it