Here is my new guided project, on Winning Jeopardy.
It was very curious in terms of results: practically everything that I tried (both guided and optional steps) clearly demonstarted being useless for preparing in advance for this quiz
Anyway, I enjoyed a lot all my attempts! It was an interesting experiment with combining many different stop word lists in one (I refused from the beginning the simplified idea to consider only the words longer than 6 letters and indeed obtained higher percentage of question overlapping). Also, considering the 50 most common words instead of just 10 random ones revealed no “high value words”. Studying the old questions can be a good idea for further investigation. Don’t forget, though, that there are too many of them. And, of course, the approach of reading some materials in advance according to the most popular categories threw no light on the situation: too many categories, even considering the possibility of merging some of them.
All in all, no results is already a result: there is no proven strategy to successfully prepare for Jeopardy.
Any feedback from you will be very appreciated. In particular, I’d like to know what can be improved/optimized in terms of code and if there are any flaws in my approaches or way of thinking. Also, any occasional typos, issues, whatever you notice, just let me know.
Many thanks in advance!
Winning_Jeopardy_Successful_Strategies.ipynb (496.4 KB)
Click here to view the jupyter notebook file in a new tab
I just finished the guided section of this project myself and am looking at completed ones before continuing with my own investigation. Thank you very much for sharing your hard work and enabling me to learn from a top-notch contributor!
I really like how clearly your project reads. Your Normalizing Columns section struck me in particular as exceptionally clean and orderly. I appreciate the formatting/structuring/writing tips you’ve given me already and I continue to aspire to your level!
You mentioned you are interested in alternate coding suggestions. I figured out a single line of code to remove the white space from the column headers I wanted to therefore share with you. It takes advantage of the fact you can pass a function/method to the df.rename() mapper parameter.
jeopardy.rename(axis = 1, mapper=str.strip, inplace=True)
Observation on a Statistic
I also noticed when looking at your Answers in Questions section that I come to the same conclusion but with a different value for the statistic.
I believe the difference is because my count_matches function returns match_count (# matching words) while yours returns match_count/len(split_answer) (% matching words).
Calculating the mean of the % of matching words I believe results in you finding the mean percentage of words in the answer that appear in the question. In other words, that on average 6% of the words in the answer appear in the question, which is not the same as 6% of the data having answers that appear in the question.
By returning the match_count instead I was able to create a frequency table to show that 87% of the rows match 0 words from the answer in the question, meaning that actually 13% of answers contain at least one word repeated from the question.
I hope you don’t mind me probing whether this calculation and phrasing completely match. It’s not that important in the sense that the conclusion is the same! But I am now pondering which statistic suits the specific goal the best. In order to disprove the hypothesis is it more effective to show that on average only 6% of words in the answer match a word in the question, or that 11% match just 1 word between the answer and question and only 2% match two or more? I’m not really sure
I will do my best to finish reviewing your project and sharing mine soon!
Thanks a lot for your detailed and thorough feedback and encouraging words, much appreciated! I’m happy that my work was helpful and gave you some good ideas for your own project. And of course, many thanks for suggesting the alternative code for removing white space, it definitely makes the code more laconic and elegant. I had 5 lines there+ 1 empty line, now I’ll have just one
About answers in questions: yes, in my case, I first calculated a fraction of how many times words from each answer occured in the corresponding question (and returned this value from the function) and then found the mean of all these values and rendered it in %. So yes, the final result of it (i.e. 6) means an average value in percentage. And yes, it’s not the same as the data having answers that appear in the question, which was your approach.
Anyway, our approaches are different (and not surprisingly, we obtained different numbers), but both seem to have a good reasoning behind them, and both shows that in very few cases the words from a question are repeated in the answer.
Very good observation @kwu, and great attention to details / searching for the truth behind the data. I’m also this kind of person: digging deeper into details and looking for any kind of hidden insights.
Thanks a lot again for your time and valuable feedback!
I’m glad you found that alternate line of code kind of cool too.
Excuse me for focusing on a relatively trivial discrepancy in approaches. It’s interesting (and almost impossible to resist!) going into every detail but obviously not always worth spending lots of time on, depending on the scenario.
Better managing my time and priorities will definitely help towards one of my goals to submit more projects for review!!