My answer to: Guided Project: Winning Jeopardy

Hi all!

I’ve just finished another project on my DQ learning path, and I’d like to share it with the community. Any feedback would be appreciated!

Here’s the URL of the last mission screen of the Guided Project and my notebook (.ipynb file):
Winning Jeopardy.ipynb (44.6 KB)

Have a nice day!

Click here to view the jupyter notebook file in a new tab

2 Likes

Hey @alvaro.viudez
Great work on this. Looks really nice to me, I’ve liked the way you’ve documented your functions. Great communication of your findings.

  • I feel that you could’ve added some EDA or data visualisation to explore the data a bit more.

Hi @info.victoromondi

Thanks for your nice words! I also think that some graphs could have make it more complete.

However, what does “EDA” mean?

Have a nice day!

EDA stands for Exploratory Data Analysis, an approach to analyzing datasets to get to know some insights.

1 Like

Cool, thanks for the tip! :slight_smile:

Hey @alvaro.viudez,

thanks for sharing your project! The code is nice, but I do not recommend using list.remove(‘the’) because it removes only the first occurrence. It is probably better to use list comprehension, for example:

split_answer = [word for word in row[‘split_answer’] if word not in (‘a’, ‘an’, ‘the’)]

However, I do not agree with your second conclusion:
“The mean proportion of questions that had been already used is 87.3 %”

This is not the mean proportion of questions that have been already used, this is the mean proportion of words (in questions) that have been previously used. Note that you split each question into a list of words and then iterate over this list, checking each word in the process. Although there is some basic filter (6 letters or more in a word), most of these words are still very commonly used ones (like ‘planet’, ‘country’, ‘person’, ‘animal’, etc, etc) and actually the obtained data give very little insight into how repeatable the questions actually are.

To get more info on this issue, I performed the same procedure for answers column. The mean was 0.37, meaning that, on average, 37% of words from an answer have been used previously. However, this does not indicate that 37% of questions are repeatable. Many of these repeatable answers are geographical places, historical data or some famous persons, and therefore the same answer could have been given to very different questions.

For instance, there are 19 answers ‘Canada’ in the dataset. However, if we run the following code:

jeopardy.loc[jeopardy[‘clean_answer’] == ‘canada’, ‘Question’]

We can check the corresponding questions, and all 19 are formulated differently. Some examples:

  1. This dominion was created by the British North America Act on July 1, 1867
  2. For collectors & investors, this country mints a maple leaf coin in gold & silver
  3. To see the tides in the Bay of Fundy at their highest, visit Minas Basin in this country

These data indicates rather that there are few identical questions in the dataset, maybe even zero. Therefore, I suggest one should be careful when interpreting results like these. I also suggest that this part of the project is probably designed ambiguously and can be misleading to many people.

P.S. For some reason, my dataset features only 19999 rows (probably an old version), so your numbers may be different.

2 Likes

Hi @m.rezvukhin ,

Thank you very much for your thorough explanation on why my interpretation was wrong.

I remember I was very confused in several parts of this guided project because, as you said at the end, I wasn’t clearly understanding the purpose of studying how many words were repeated, as they could be functioning differently.

So, yes, you’re absolutely right, and that percentage refers to words, not to questions (which would be a more reasonable analysis if we were to actually prepare for the show).

Have a great day!

1 Like