Unclear Expln of counting words in Naive-bayes-algorithm/6/multiple-words

Screen Link: Learn data science with Python and R projects

The explanation of how to calculate P(Wx|Spam) isn’t clear.

P(Spam) makes sense: 2 spam messages / 4 total messages

But the P(Wx | Spam) is not clearly explained.
In the two spam msg, there are a total of 7 words.

  • P(W1) - "the first word is ‘secret’ and ‘secret’ occurs 4 times so P(w1) = 4/7

According to the expln, similar reasoning says:

  • P(W2) = 1/7
  • P(W3) = 4/7

How are these numbers determined?

  • “secret” appears as W1 only once
  • “secret” appears as W2 only once
  • “secret” is W3 and W4 once, too.

So either all the values should be 1/7 (if the position is the critical)
or they should all be 4/7 (following the logic of the “reasoning”).

Why is P(W2) different than the others?

1 Like

Hi @adamlporter,

Could you please share your feedback with the Content & Product teams of Dataquest? Just click the ? button in the upper-right corner of any screen of the Dataquest learning platform, select Share Feedback, fill in the form, and send it. Thanks!

Hey, Adam.

Note the following:

There are four words in the message “secret place secret secret”, and we’re going to abbreviate them “w1”, “w2”, “w3” and “w4” (the “w” comes from “word”).

This is just to make it easier to read. The screen could read, instead, like this:

P(\text{Spam}\, |\, \text{secret}, \text{place}, \text{secret} , \text{secret}) \propto P(\text{Spam}) \cdot P(\text{secret}\,|\,\text{Spam}) \cdot P(\text{place}\,|\,\text{Spam}) \cdot P(\text{secret}\,|\,\text{Spam}) \cdot P(\text{secret}\,|\,\text{Spam})

. . . and so on.

This way, there is no reference to positioning. Does this help?