It’s not incorrect arithmetically.
Let’s say you received 7 Non-Spam
messages. You write down the frequency of the words appearing in those messages -
word |
count |
hello |
5 |
friend |
3 |
test |
4 |
The total number of words is 12
. So, if you wanted to calculate the probability,P("friend" | Non-Spam)
, you could do it simply as \frac{3}{12}
If we know the message is Non-Spam
, and the number of times the word friend
appears in Non-Spam
messages, then the probability above is simply the number of times the word friend
appears in those Non-Spam
messages divided by the total number of words in Non-Spam
messages.
Now, let’s say you got 2 Spam
messages.
You write down the frequency of the words appearing in those messages -
word |
count |
hello |
3 |
friend |
2 |
test |
0 |
What would be the probability, P("test" | Spam)
? Similar to how we calculated P("friend" | Non-Spam)
\frac{\text{number of times "test" appears in Spam messages}}{\text{total number of words in Spam messages}} = \frac{0}{5}
And that would be 0
. Which is a problem, as we know.
If we don’t want it to be a problem, we consider modifying the probability just a little bit so that it is no longer 0
.
We can consider increasing the frequency of test
in our Spam
messages data above. However, it doesn’t really seem “fair” that we are only increasing the frequency for one word. So, to keep it “fair” we increase it for every word.
So, our new data becomes -
word |
count |
hello |
3+1 |
friend |
2+1 |
test |
0+1 |
What’s the probability now?
\frac{\text{number of times "test" appears in Spam messages}}{\text{total number of words in Spam messages}} = \frac{0+1}{5+1+1+1}
That 1
is our \alpha.
And you can see what happens in the denominator. It’s not just increasing by \alpha, but by N\alpha where N is the number of unique words in our data.
Because that’s what helps “balance” the equation - we don’t just randomly add one word; we uniformly adjust the distribution.
It’s simple to make sense of it when we consider it as modifying the actual word count. However, that \alpha doesn’t have to be just 1
. It can be a smaller or a larger value as well. And because of that, N\alpha is even more helpful because you are making the update uniformly instead of for just one word.