Tags:210-4 Finding how often an answer occurs in the question

Screen Link: https://app.dataquest.io/m/210/guided-project%3A-winning-jeopardy/4/answers-in-questions
https://github.com/dataquestio/solutions/blob/master/Mission210Solution.ipynb

The solution notebook uses the following code to find how often an answer occurs in the question:

jeopardy["answer_in_question"].mean()

Which gives 0.05900196524977763

and the following explanation is given:

The answer only appears in the question about 6% of the time. This isn’t a huge number, and means that we probably can’t just hope that hearing a question will enable us to figure out the answer. We’ll probably have to study.

However, when I calculate the following:

answer_in_question_no = len(jeopardy[jeopardy["answer_in_question"] > 0])
total_questions = len(jeopardy)
answer_in_question_pct = answer_in_question_no  / total_questions

I get 0.12620631031551577

which is about double the mean. Shouldn’t the percent of the time be 12%?

Thank you for your time!

3 Likes

This can be tricky to understand. So, let’s use numbers for a sample problem.

Let’s say you are given set of lists -

A B
[1, 2] [1, 2, 3, 4]
[6] [1, 2, 4]
[1, 4] [1, 2, 3]

You are trying to find proportion of A that appears in B. This is the same as the count_matches function in the task.

For first row, we have both numbers in A appear in B. So, our proportion is 1

For second row, the proportion would be 0

For third row the proportion would be 0.5

Take some time to make sure you understand how the above values came to be. It’s essentially the two steps below -

  • Loop through each item in split_answer , and see if it occurs in split_question . If it does, add 1 to match_count .
  • Divide match_count by the length of split_answer , and return the result.

So, we have our proportions. For the third row, based on our 0.5 proportion we can say that 50% of A occurs in B (for just that row).

On average, how much of the values in A occur in B (considering all the rows)?

That would be taking the average of the proportions. So,

(1 + 0 + 0.5)/3 = 0.5

On average, we can say that, 50% of a list in A occurs in B. This is essentially what they refer to with the 6% value. On average, 6% of the answer is present in the question.

Now, coming to your approach. Our proportions are -

1, 0, 0.5

Total number of values from above that are not 0 = 2

Total number of values = 3

Average number of values that are not 0 = 2/3

Do you notice the difference?

You are calculating the average number of times A is present in B. That is, average number of times an answer is present in a question.

What is required is calculating the average of how much of A is in B. That is, on average, how much of an answer is in the question. And this is what’s important in the context of the project. It helps answer -

How often the answer is deducible from the question.

We can say from our numerical data that 2 lists in A out of 3 are present in B. But that doesn’t help us answer how much of those lists in A (that is what percentage of values in the lists in A) are present in B.

4 Likes

Hi the_doctor,

Thank you very much for your reply and I agree with you. However, the question is: How often an answer occurs in the question? which to me means the approach I used is correct. If the question was, as you said How much of the answer is in the question? then the approach used in the solution notebook would be correct.

Am I still missing something?

2 Likes

Hi @jcamilleri91,

The question, How often an answer occurs in the question?, is essentially asking you to find out the expected amount of times the answer will be mentioned in the question on average. In other words, how much of an answer (in terms of words) is in its question. That means how many words in an answer might be used in its corresponding question. This is why in @the_doctor example:

For the first row, the proportion was 1 because all the elements (in our case words) of A (in our case answer) can be found in B (our question).

For the second row, we have 0 because no element of A was found in B and so on. And the average of these proportions would give us the answer we seek, how often the answer occurs in the question.

Your approach answers the question: How many questions have their answers mentioned within them? which tells us how often to expect the answer to be given away in the question.

I hope this helps.

2 Likes

Hmm… I think you might be right about this. I will have to think about this a bit more to be sure.

The question in the content is -

How often the answer is deducible from the question.

The how often part does indeed relate more to the frequency of the occurrence and not what percentage of an answer is deducible from the question on average.

I will think about this more.

In the meantime, @Sahil, could you (or maybe Alex/Bruno as per the solution commits) please provide an official clarification on this? I think that would be helpful here.

5 Likes

It is clarified in the learning section:

You can answer the first question by seeing how many times words in the answer also occur in the question. We’ll work on the first question now, and come back to the second.

But perhaps, it can be rephrased so that it becomes more apparent. I will get it logged for review.

Best,
Sahil

1 Like

@jcamilleri91 Great question and great answers from @the_doctor as always.

I had the same question initially, and after reading the comments, my new question is, if we put the guides from this project aside, realistically, what would help more in our goal of winning Jeopardy. To know the proportion of the times words in an answer also appeared in the question – essentially a True or False question, or, the proportion of how many words in an answer appeared in the question?

Personally, I tend to pick the first one, based on the fact that the answers in Jeopardy are usually short with one or two words. I calculated the mean of answer_in_qustion only when the answers contain words from the questions:

In: jeopardy.loc[jeopardy.answer_in_question>0, 'answer_in_question'].mean()
Out: 0.4675040820246842

We can see that the average is close to 50%. So when the words in answers are in the questions, you are almost given half of the answer… Of course, there’s also if you are aware that the answer lies in the question in the first place, but that’s not what we are talking about here.

So realistically speaking, personally, I feel like the average word proportion of all questions is kinda misleading in the real situation because it’s diluted by all the zeros. if I were doing this project on my own, I’d probably do what @jcamilleri91 did.

I’m still learning and have come to realize that decisions like this are the real challenges for me in projects. Statistics is so tricky… I’d love to hear more thoughts and opinions.

Cheers!

3 Likes

:white_check_mark: We have fixed this issue by changing the following text in the solution page:

- "The answer only appears in the question about 6% of the time.
+ "On average, the answer only makes up for about 6% of the question.

5 Likes