I'm having trouble putting some parameters in the naive Bayes algorithm into comprehensive terms

https://app.dataquest.io/m/432/the-naive-bayes-algorithm/3/using-bayes-theorem

I was wondering if anyone could help me put this into layman’s terms. I understand the trick on how to apply the calculations to reach the desired result but I do not exactly get what the 3 “new_message” parameters mean (see below).

I understand that the probability of a new message being spam or non spam both equal 0.5. What I do not understand is what the p_new_message means.

‘’’
p_spam = 0.5
p_non_spam = 0.5
p_new_message = 0.5417
p_new_message_given_spam = 0.75
p_new_message_given_non_spam = 0.3334
‘’’

My unconfirmed assumptions are:

p_spam = P of any new message being spam
p_spam = P of any new message not being spam
p_new_message = P of a new specific message being spam based on its contents
p_new_message_given_spam = P of this message having been correctly assigned to spam
p_new_message_given_non_spam = P of this message having been correctly assigned to non spam

Sorry I’ve been posting many questions lately…

thank you in advance!

1 Like

No, that would be incorrect. P(x) is not a conditional probability. It is just the probability of x.

In this case, x is New Message. So, P(New Message) is just the probability of whether a new message has arrived.

2 Likes

p_spam: Is the probability that a message is spam. In this case, half of the messages received are spam messages

p_non_spam: Is the probability that a message is not spam. p_non_spam = 1 - p_spam when they are mutually exclusive and exhaustive

p_new_message : Is the probability of getting a new message and it is equals p_non_spam * p_new_message_given_non_spam + p_spam * p_new_message_given_spam

p_new_message_given_spam: Is the conditional probability that a new message is spam

p_new_message_given_non_spam: Is the conditional probability that a new message is non spam

If I remember correctly, p_spam and p_non_spam are called prior probabilities. This is the probability that you have before your experiment.

2 Likes

Thank you for your quick answer. Tho I’m still unclear about the Importance for a spamfilter to know the probability of a new message arriving? Would it not just arrive or not arrive?

Is a new message only a new message when its uniquely different to another? In other words is a duplicate message not counted as a new message?

2 Likes

Thank you for elaborating! However it doesn’t click for me why two mutually exclusive and exhaustive probabilities have a combined P higher than 1 in their conditional state tho.

p_new_message_given_spam = 0.75
p_new_message_given_non_spam = 0.3334
1 Like

Your conditional probabilities are dependent on other events happening. So they do not sum up to 1.

1 Like

But if two probabilities add up to more than P = 1 it must mean that an outcome can be in both states. Therefore both non-spam and spam at the same time. Since P can never be larger than 1 when mutually exclusive, correct?

1 Like

You do not sum conditional probabilities.

Probabilities that are not conditional sum up to 1.

1 Like

Hi, I still don’t see the sense in having the probability of getting a new message.

A spam filter should assess whether a message that has arrived is spam or not, so probability of a new message having arrived should always be 1.

Yes, you can derive this quantity using the formula you stated, but this doesn’t confirm its physical meaning.

1 Like

@pgfox96

Sorry for the late reply.

The probability of new message is the sum of the probabilities of getting a non_spam and a spam message.

I would recommend Chapter 13: Naive Bayes in Data Science from Scratch by Joel Grus.

2 Likes

Sorry but that still doesn’t answer my question.
Look at this screenshot, from the ‘Instructions’ for Screen 3.
DQ_question

Now if a new message has been received, then why does:

P(New message) = 0.5417?

The message has been received! It makes no sense to talk of the ‘probability of getting a new message’, as you said in your first response.

Is there something I am not understanding? Any help would be much appreciated.

1 Like

@pgfox96

If you applied the formula for p_new_message above you will get 0.5417.

This is how Naive Bayes works. The probability of new_message is the same as getting the probability of delay in this mission

p_delay = p_boeing*p_delay_given_boeing + p_airbus*p_delay_given_airbus

If you agree with the above, please apply similar understanding.

Cheers!

1 Like

Hey folks, sorry to see this is causing confusion!

P(\text{New Message}) doesn’t mean the probability that a new message has arrived. When a message arrives, the probability that it arrives is always 1.

\text{New Message} is a variable name that takes concrete values in concrete cases. For instance, if the new message is “Urgent!!!”, then P(\text{New Message}) = P(\text{"Urgent!!!"}).

If the new message is “Hey Tom, are u home?”, then P(\text{New Message}) = P(\text{"Hey Tom, are u home?"}).

5 Likes

I think that going back to a chart like this might be helpful to visualize what is going on.

The outermost box represents the ratio of all messages being either spam or not spam. This is our prior probability. In this case, we get a new message and without looking at contents of the message at all, we are told the message has a .5 probability of being spam and a .5 probability of being not-spam so the box is simply divided in half.

Next, we are told that P(new message) = .5417. What does this mean??? This is the probability, as determined by our algorithm, that this message is spam . This is the area of the chart that is pink.

And finally, we are told

P(new message | spam) = 0.75
P(new message | not-spam) = 0.3334

We can think of this as saying, 75% of the spam side tests as spam and is pink, and 33% of the not-spam side also tests as spam and is pink. This is the posterior probability. Note that these DO NOT add up to 1 or 100%. However,

P(new message | spam) + P(not-new message | spam) = 1 
P(new message | not-spam) + P(not-new message | not-spam) = 1

The questions for this lesson ask us to calculate what ratio of the pink is on the spam side, and what ratio is on the not-spam side.