Naive Bayes Additive Smoothing- Why Not Use All Words?

Screen Link:

Hi all, in lesson 432-10, why are not all of the words used? “code”, “to”, and “unlock” aren’t used in the calculation. We just went through learning how to calculate for words that aren’t in the vocabulary only to use words that already are. I don’t see anything in the text of the lesson that discusses omitting words.

OK. I see on the previous screen where it says to skip them, but I still don’t fully understand why. Is that just the given technique and there are others where those words would be considered?

And I’m curious (maybe this will be covered in the project), will those words be added to the vocabulary after processing this message?

EDIT:
Rereading the lesson a little I see that the message that is unlabeled and the words “secret” and “code” and “to” are not found in the other four messages that you have human labels for. The goal is to use the human labels to predict the unclassified labels, and since these words have not been in sentences that have been previously classified you do not know how they impact the spam or ham prediction.

Hi Patrick,

It’s been a while since I’ve worked on Naive-Bayes, but from my what I remember these words, commonly known as ‘stopwords’, add no contextual information. Stopwords are typically words used in the English language for syntax purposes, and since naive-bayes is a ‘bag-of-words’ approach where syntax and word ordering or sentence structure do not matter, it can reduce the noise in the prediction to simply remove them.

That’s not always the case, and for projects like sentiment analysis certain stopwords are necessary. So it needs to be considered and tested on a case by case basis.

In the word choices you highlighted the only stopword would be “to”. I’m surprised they’re telling you to remove “code” and “unlock”, so I would go back and read carefully that that’s what they’re asking or whether there is some other learning purpose to doing so.