RegEx Freq_Table Q3

https://app.dataquest.io/c/64/m/369/advanced-regular-expressions/3/using-capture-groups-to-extract-data

My Code:

# My Solution
pattern = r"[Pp]ython ?([\d.]+)"

# DQ Solution
# pattern = r"[Pp]ython ([\d\.]+)" 

py_versions = titles.str.extract(pattern, flags=re.I, expand=False)
py_versions_titles = titles[titles.str.contains(pattern, flags=re.I)]
py_versions_freq = dict(py_versions.value_counts())

What I expected to happen:

py_versions_freq
dict (<class 'dict'>)
{'3': 10,
 **'3.5': 4,**
 '2': 3,
 '3.6': 2,
 **'4': 2,**
 '3.5.0': 1,
 '1.5': 1,
 '2.7': 1,
 '8': 1}

What actually happened:

py_versions_freq
dict (<class 'dict'>)
{'3': 10,
 **'3.5': 3,**
 '2': 3,
 '3.6': 2,
 '3.5.0': 1,
 '1.5': 1,
 '2.7': 1,
 **'4': 1,**
 '8': 1}

Notice the counting differences for 3.5 and 4. There was some discussion on this in a closed post for a different question (Regex to extract Python versions)

From what I can tell this line
19458 Transition to Python4 won't be like Python3(we've learned our lesson)
Is the culprit. It appears in my solution, but not the DQ solution. It does mention two different versions though.

Perhaps the question isn’t explicit enough? Should there only be one instance of a Python version per title? That doesn’t make sense. Help?

The instruction reads. . .

Write a regular expression pattern which will match Python or python, followed by a space, followed by one or more digit characters or periods.

Emphasis added by me. You’re not dealing with this part correctly.

1 Like

Oh man. Well I appreciate you pointing that out. Your emphasis helps. Maybe DQ should update the exercise?

I’m still confused though. Isn’t the point to find Python mentions and their subsequent version numbers? This DQ solution leaves one out altogether. I guess we’re supposed to solve it as explicitly as possible?

That’s right. However, where do you draw the line? Would “Python three” count? What about something like “… project in Python 3 years ago. . .”?

As a human, you can tell which should be captured as versions, and which shouldn’t. But can you really?

  • They worked on this project in Python 3 years ago.
  • They worked on this project in Python three years ago.

So there are two layers here:

  • Regex not being powerful enough to resolve all the ambiguities
  • The ambiguity not being actually resolvable

You do the best you can keeping in mind the cost/benefit.

1 Like

Dang that’s a great point. So it hits a point where I should cut my losses and just use the bulk of the correct answers since we can’t sort them all without going one by one. Or using a more powerful computer (AI)?

Using machine learning for this could potentially be a solution, but it’s not going to be often that it justifies the cost. Let’s also not forget that sometimes the title just isn’t enough, it’s objectively ambiguous in a way that can’t be resolved, no machine learning is going to be able to handle that perfectly.

Speaking of perfection, machine learning is anti-perfection by design. It just tries to be really, really good — never perfect. So you’d still miss some. What was done here is already very good, by my estimation.

1 Like