Confusion on Word Boundary

Screen Link:

My Code:

pattern = r"\b[Jj]ava\b"
java_titles = titles[titles.str.contains(pattern)]

test = titles[titles.str.contains(r"using\?")]
test1 = titles[titles.str.contains(r"using\?\b")]
test2 = titles[titles.str.contains(r"\busing\?\b")]

print(titles[4576])
print(titles[11402])

Output:

Ask HN: Which linux/unix C++/C IDE are you using?
Ask HN: Moving Out of Silicon Valley because of housing? Where to?
  1. I thought that test and test1 would have the same result. Why aren’t titles[4576] and titles[11402] included in test1?
  2. Why doesn’t test2 include titles[4576]? In this lesson, we were able to get the titles with the word Java at the end of the string. I tried to apply that understanding to test2, but I’m not sure where I went wrong.


Thank you in advance!

I would first recommend going through the Accessing the Matching Text with Capture Groups Screen again and understanding what r (raw string) really does. Then compare that to your pattern - what does \? do?

1 Like

a word boundary is a position between a word character (\w or [0-9A-Za-z_] ) and a non-word character (\W), or beginning or end of word character.
\? is a non-word character.

So my understanding right now is that the raw string will prevent Python’s escape sequences (\b for word boundary instead of backspace), but now I’m not sure what \? does if r prevents it from using Python’s escape sequence of a question mark. My initial thought was that maybe \? becomes an optional backslash, but running code rejects that idea.

I also tried the pattern of \\busing\?\\b on a little test data that contains the string "test using? test" to see if I would get the result I expect, but I didn’t, and I’m not sure how to go about it next. I’m a little stumped.

Adding to my previous comment
\busing\?\b won’t work since \? is a non word character.
But \busing\b\? would work since the boundary is starting and ending at a word character

2 Likes

Ohhh. I understand the definition now. Thank you!