Would like to hear more explanation of additive smoothing

Screen Link: https://app.dataquest.io/m/432/the-naive-bayes-algorithm/10/additive-smoothing

Question
I would like to know why we need to add α*N(vocabulary) in the denominator of P(‘the’|Spam) when we apply additive smoothing to it. What is a meaning of the value?

What I have known so far
I understand that the purpose of additive smoothing is to avoid getting probability of zero. And I checked some articles like this,https://medium.com/analytics-vidhya/intuition-behind-naive-bayes-algorithm-laplace-additive-smoothing-e2cb43a82901. But, I have not understood well.

1 Like

Let’s take the example in the Classroom -

image

The word the appears in an SMS, and the corresponding Label for that SMS is non-spam.

What happens if the occurs in an SMS which is clearly a spam message?

If so far, we have only seen the occur in non-spam messages, then the probability of the word occurring in a spam message would be 0.

But does that make sense? Do you think there are no spam messages where the word the won’t occur at all?

No, right? That’s not possible. And that’s what the 4th sample in that table indicates too. If you have the message - secret code to unlock the money, you can see there is a the in that message.

By reading it, we know that it is spam. But, if we calculate the probability of whether or not that message is spam, we will have to consider the probability of whether the message is spam if the word the is present in it.

And because of our existing data, since we saw the only in non-spam messages, the probability that this message is spam given that the word the occurs in it will be 0.

Which would be wrong. So, how do we correct for that?

We make sure that it’s not 0, by adding a small value to that probability. This small value is the Additive Smoothing. It essentially ensures that even though the word the was not present in any spam messages in our existing data, that does not mean that it won’t be present in any spam messages in any unseen data.

And if that value is not 0, then that means the probability that the message secret code to unlock the money could be a spam message is also not 0.

Go through the above, and see how the above relates to the equation present in that Mission. It should start becoming clear after some time I hope.

4 Likes

Thank you so much for your very detailed and helpful explanation! I really appreciate your time for answering my question.

1 Like

I had the very same doubt. After doing some reading I’ve come to understand that N is the number of unique values in the distribution. As to why its added to the denominator it still remains unclear.

My lack of clarity stems from the fact that you are adding α to the numerator but then adding αN to the denominator which seems incorrect arithmetically, if the purpose was to balance the addition.

For example if I had a fraction 3/5 and multiplied 2 to both the numerator and denominator it makes sense. Here, however, you are adding to the denominator a factor of what was added to the numerator.

I’ve resigned myself to accepting that this is how the formula has been set up and left it at that. Kind of like a2+ b2=c 2 (Pythogoras theorem) (but I can prove that)

It’s not incorrect arithmetically.

Let’s say you received 7 Non-Spam messages. You write down the frequency of the words appearing in those messages -

word count
hello 5
friend 3
test 4

The total number of words is 12. So, if you wanted to calculate the probability,P("friend" | Non-Spam), you could do it simply as \frac{3}{12}

If we know the message is Non-Spam, and the number of times the word friend appears in Non-Spam messages, then the probability above is simply the number of times the word friend appears in those Non-Spam messages divided by the total number of words in Non-Spam messages.

Now, let’s say you got 2 Spam messages.

You write down the frequency of the words appearing in those messages -

word count
hello 3
friend 2
test 0

What would be the probability, P("test" | Spam)? Similar to how we calculated P("friend" | Non-Spam)

\frac{\text{number of times "test" appears in Spam messages}}{\text{total number of words in Spam messages}} = \frac{0}{5}

And that would be 0. Which is a problem, as we know.

If we don’t want it to be a problem, we consider modifying the probability just a little bit so that it is no longer 0.

We can consider increasing the frequency of test in our Spam messages data above. However, it doesn’t really seem “fair” that we are only increasing the frequency for one word. So, to keep it “fair” we increase it for every word.

So, our new data becomes -

word count
hello 3+1
friend 2+1
test 0+1

What’s the probability now?

\frac{\text{number of times "test" appears in Spam messages}}{\text{total number of words in Spam messages}} = \frac{0+1}{5+1+1+1}

That 1 is our \alpha.

And you can see what happens in the denominator. It’s not just increasing by \alpha, but by N\alpha where N is the number of unique words in our data.

Because that’s what helps “balance” the equation - we don’t just randomly add one word; we uniformly adjust the distribution.

It’s simple to make sense of it when we consider it as modifying the actual word count. However, that \alpha doesn’t have to be just 1. It can be a smaller or a larger value as well. And because of that, N\alpha is even more helpful because you are making the update uniformly instead of for just one word.

2 Likes

@the_doctor

Thank you for the explanation! :slight_smile: It is clear now. The references in the lesson do not have the clarity you offer.

Cheers!!

The example is really good as I was also not getting the reason why there was addition in the denominator. Thanks.

Finally i get it, thanks!