# My Solution
pattern = r"[Pp]ython ?([\d.]+)"
# DQ Solution
# pattern = r"[Pp]ython ([\d\.]+)"
py_versions = titles.str.extract(pattern, flags=re.I, expand=False)
py_versions_titles = titles[titles.str.contains(pattern, flags=re.I)]
py_versions_freq = dict(py_versions.value_counts())
What I expected to happen:
dict (<class 'dict'>)
What actually happened:
dict (<class 'dict'>)
Notice the counting differences for
4. There was some discussion on this in a closed post for a different question (Regex to extract Python versions)
From what I can tell this line
19458 Transition to Python4 won't be like Python3(we've learned our lesson)
Is the culprit. It appears in my solution, but not the DQ solution. It does mention two different versions though.
Perhaps the question isn’t explicit enough? Should there only be one instance of a Python version per title? That doesn’t make sense. Help?
The instruction reads. . .
Write a regular expression pattern which will match
python, followed by a space, followed by one or more digit characters or periods.
Emphasis added by me. You’re not dealing with this part correctly.
Oh man. Well I appreciate you pointing that out. Your emphasis helps. Maybe DQ should update the exercise?
I’m still confused though. Isn’t the point to find Python mentions and their subsequent version numbers? This DQ solution leaves one out altogether. I guess we’re supposed to solve it as explicitly as possible?
That’s right. However, where do you draw the line? Would “Python three” count? What about something like “… project in Python 3 years ago. . .”?
As a human, you can tell which should be captured as versions, and which shouldn’t. But can you really?
- They worked on this project in Python 3 years ago.
- They worked on this project in Python three years ago.
So there are two layers here:
- Regex not being powerful enough to resolve all the ambiguities
- The ambiguity not being actually resolvable
You do the best you can keeping in mind the cost/benefit.
Dang that’s a great point. So it hits a point where I should cut my losses and just use the bulk of the correct answers since we can’t sort them all without going one by one. Or using a more powerful computer (AI)?
Using machine learning for this could potentially be a solution, but it’s not going to be often that it justifies the cost. Let’s also not forget that sometimes the title just isn’t enough, it’s objectively ambiguous in a way that can’t be resolved, no machine learning is going to be able to handle that perfectly.
Speaking of perfection, machine learning is anti-perfection by design. It just tries to be really, really good — never perfect. So you’d still miss some. What was done here is already very good, by my estimation.