I want to compare one sentence to some other sentences using the Bag of Words model. Suppose that my comparing sentence is:
I am playing football
and there are three more sentences that I want to compare my comparing sentence with. They are:
1. and I am playing Cricket
2. Why do you play Cricket
3. I love playing Cricket when I am at school
Now, if I compare my comparing sentence to the above three sentences by counting words, the number 1 and number 2 sentences have the same number of words that the comparing sentence has. and that is 3 (I, am , playing).
Now the question is, Which sentence is more related to my comparing sentence in this case? there are no semantic meanings involved at all.
In some places I saw, they say, it is less convoluted to return the shortest sentence in this case. What are your thoughts?
Hi @hefaz2010 welcome to the community!
Not very familiar with bag of words myself but this article should help.
Once a vocabulary has been chosen, the occurrence of words in example documents needs to be scored.
In the worked example, we have already seen one very simple approach to scoring: a binary scoring of the presence or absence of words.
Some additional simple scoring methods include:
Counts. Count the number of times each word appears in a document.
Frequencies. Calculate the frequency that each word appears in a document out of all the words in the document.
I have already implemented that algorithm. in my algorithm, one sentence is compared to multiple sentences and it returns the one with most matched words. Now the problem is that if I compare one sentence to multiple sentences and let us say, that there are two sentences which have the same words as my comparing sentence. then which one should I return? Please refer to the question for the example.