Screen Link:
https://app.dataquest.io/m/210/guided-project%3A-winning-jeopardy/4/answers-in-questions
In this session, we trying to answer the question "How often the answer is deducible from the question."For this, we are creating a function to count the number of words in the answer that are repeated from the corresponding question and finally we calculate the mean of the repeated words. When I compared the value I got with the DQ solution, I saw there is a small difference. The mean I got is 0.05900196524977763 whereas that found in the solution is 0.060493257069335872. When I tried to analyse this, I found the reason is the way I splitted the question and answer columns.
The code I used to achieve this is given below:
split_answer = row['Clean_Answer'].split()
split_question = row['Clean_Question'].split()
Whereas the code used by DQ solution is by using a space in the paranthesis as shown below. Why are you doing so? Won’t this create additional list item/s?
split_answer = row['Clean_Answer'].split(' ')
split_question = row['Clean_Question'].split(' ')
Before this step we are cleaning the question and answer columns using regex.
text = 'Dr. Benjamin Spock (\" Baby and Child Care \")'
text = text.lower()
text = re.sub(('([^a-z0-9\s])'), '', text)
print(text)
Output : 'dr benjamin spock baby and child care '
As you can see there are additional space before ‘baby’ and after ‘care’. Now we use the split method to split the text around space. Please note the differences in the output for the above mentioned split() approaches below :
Method 1 : without space in the paranthesis
text = text.split()
Output : ['dr', 'benjamin', 'spock', 'baby', 'and', 'child', 'care']
Method 2 : with space in the paranthesis
text = text.split(' ')
Output : ['dr', 'benjamin', 'spock', '', 'baby', 'and', 'child', 'care', '']
So this would cause the empty spaces to be counted while counting the length of the lists and hence affecting the final mean. Please provide your insight on this.