Removing words less than 6 characters

Screen Link:


My Code:
question_overlap = []
terms_used = set()

#Sort jeopardy dataframe by ascending air date 
sorted_jeopardy = jeopardy.sort_values(by='Air Date')
import re
#Loop through each row of dataframe
for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    split_question = re.sub(r'[\w{6,}]', '', split_question)
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    terms_used = set.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
    jeopardy['question_overlap'] = question_overlap
    print(jeopardy['question_overlap'].mean())

What I expected to happen:

question overlap mean to print
What actually happened:

TypeErrorTraceback (most recent call last)
<ipython-input-20-2b9cd06f8b25> in <module>()
      8 for i, row in jeopardy.iterrows():
      9     split_question = row['clean_question'].split()
---> 10     split_question = re.sub(r'[\w{6,}]', '', split_question)
     11     match_count = 0
     12     for word in split_question:

/dataquest/system/env/python3/lib/python3.4/re.py in sub(pattern, repl, string, count, flags)
    177     a callable, it's passed the match object and must return
    178     a replacement string to be used."""
--> 179     return _compile(pattern, flags).sub(repl, string, count)
    180 
    181 def subn(pattern, repl, string, count=0, flags=0):

TypeError: expected string or buffer

In the instructions it states to

* Remove any words in  `split_question`  that are less than  `6`  characters long.

Can’t this be done with a regex pattern? I used

split_question = re.sub(r'[\w{6,}]', '', split_question)

PatternsinJeopardyQuestions-Copy2.ipynb (22.4 KB)

You are supplying the list created by split() to re.sub(). Regex needs a str.

import re
a = ["This is it arrrrrrrr.", "Youuuuuu Thank You! Yeeeeeeeeeeees"]
for stri in a:
  split_question_list = stri.split()
  for word in split_question_list:
    split_question = re.sub(r'\w{6,}', '', word)
    print(split_question)

Output:
This
is
it
.
.

Thank
You!
2 Likes

Ah, I see. Thank you.

1 Like