In this session, we trying to answer the question "How often the answer is deducible from the question."For this, we are creating a function to count the number of words in the answer that are repeated from the corresponding question and finally we calculate the mean of the repeated words. When I compared the value I got with the DQ solution, I saw there is a small difference. The mean I got is 0.05900196524977763 whereas that found in the solution is 0.060493257069335872. When I tried to analyse this, I found the reason is the way I splitted the question and answer columns.
The code I used to achieve this is given below:

 split_answer = row['Clean_Answer'].split()
 split_question = row['Clean_Question'].split()

Whereas the code used by DQ solution is by using a space in the paranthesis as shown below. Why are you doing so? Won’t this create additional list item/s?

 split_answer = row['Clean_Answer'].split(' ')
 split_question = row['Clean_Question'].split(' ')

Before this step we are cleaning the question and answer columns using regex.

text = 'Dr. Benjamin Spock (\" Baby and Child Care \")'
text = text.lower()
text  = re.sub(('([^a-z0-9\s])'), '', text)

Output : 'dr benjamin spock  baby and child care '

As you can see there are additional space before ‘baby’ and after ‘care’. Now we use the split method to split the text around space. Please note the differences in the output for the above mentioned split() approaches below :

Method 1 : without space in the paranthesis
text = text.split()
Output : ['dr', 'benjamin', 'spock', 'baby', 'and', 'child', 'care']

Method 2 : with space in the paranthesis
text = text.split(' ')
Output : ['dr', 'benjamin', 'spock', '', 'baby', 'and', 'child', 'care', '']

So this would cause the empty spaces to be counted while counting the length of the lists and hence affecting the final mean. Please provide your insight on this.


I also observed this. There are many errors in this project. and also the code standard is not up to the mark of DQ if we compare it to the rest of their Guided Projects. May be some individual coder is responsible here and not the whole DQ community.

In the same solution where you want ‘the’ keyword removed from ‘clean_answer’, they have used remove() function but it only removes the first occurrence and not all which again will affect the end result.


Yes, you are right. I am also pretty dissappointed with this guided project.

I also did not initially understand the logic of removing only the first occurrence of the word ‘the’. There are many answers starting with ‘the’ like ‘The Verdict’. So I assume, their idea was to remove ‘the’ from such answers as it did not make much sense.

