432 Naive Bayes : Why do we need to multiply this numbers rather than add them together?

Hello everyone

I get some trouble to figure out this:

The lecture context mentions that we need to treat each word separately.
Is it means that the four different event are independent, so in order to calculate the probability we need to multiply them?

lecture link: Learn data science with Python and R projects

That’s how probability theory is designed.
All probabilities are 0-1 inclusive. If you add 2 events which are probability 0.6 and 0.7, they will overshoot one.
However if you multiply probabilities, they will never go above 1.

Event A = draw King
Event B = draw Diamond
P(A) = 4/52
P(B) = 1/4
Event C = draw King Diamond = draw King and draw Diamond
P(C) = P(A and B) = P(A) * P(B) = 4/52 * 1/4 = 1/52 (Note multiplication, not addition)

We multiply so the numerator of P(A) cancels the denominator of P(B).
In the end the overall denominator remains at 52, makes sense because we select 1 out of 52. The overall numerator is 1, which is also correct because there is 1 King Diamond.

P.S Note that this example is contrived and was purposely selected to have numerator and denominator with exactly same raw counts to be cancellable and allows easy relating to their physical meaning. Understanding various interpretations of probability is helpful: https://www.stat.berkeley.edu/~stark/SticiGui/Text/probabilityPhilosophy.htm

Look up any probability tree and you’ll see something like this.
The tree maps out all possible outcomes, and the paths to get there.
Outcomes are ordered, so H,T is not T,H. In school exams, things can tricky when they define “derived” outcomes that require you to merge multiple outcomes (eg. Find probability that outcomes contains 1 head and 1 tail, no matter the order.) This is when you add 1/4 + 1/4 = 2/4.
We move from left to right, multiplying along the way, until we each the desired outcome.
We do this multiple times until we find all the outcomes. (Of course this is just to demonstrate the thinking process of a student in an exam, in a real problem all outcomes can be precalculated and ready to be summed/returned by the program, with program maintenance involving updating accurate probabilities along the paths and precalculating the updated outcomes again, assuming this tree is not too big).

If we wanted to interpret any path that doesn’t connect to the root of the tree, then it’s a conditional probabilty. The more rightwards you go, the more events that edge is conditional on.
For example if we look at the 4 edges in “Second Flip” column, all of them can be described by “Given the First Flip = X, what is Second flip?”. Notation will be P(2nd flip = X | 1st flip = Y).

So the overall Tree for your problem has 2 branches from root {Spam , Not Spam}. Once we move 1 step rightwards into the Spam route, we now look at the current tree from this point to get all the w1|spam , w2|spam…
(From https://www.quora.com/Why-do-we-multiply-the-probability-of-independent-events)
Imagine this example is also a tree. Now imagine we take all the 5 branches from A and squeeze them together into B. This squeezing together expresses an idea of “factorization”, like XZ+ YZ = (X+Y)Z.
If how you travel from B-C is independent of how you travel from A-B then no matter which of 5 paths you take from A-B, the same 3 B-C probabilities apply. So instead of drawing 5 versions of B, each with 3 paths to C to result in a tree with 15 outcomes, we factorize the 5 versions of B into 1 B.

You can substitute the 5 A-B edges as tossing a die (6 edges), and the 3 B-C edges as flipping a coin (2 edges) to think through it again. Tossing a die and flipping a coin and got nothing to do with each other, so should be independent. P(2 on die and Head on coin) = 1/6*1/2 = 1/12. (Multiply because you are walking along this probability tree). If Coin toss was dependent on die outcome, then the 2nd term may be fluctuating with 1/2, 1.1/2, 0.8/2, … (6 fractions depending on 6 outcomes of die), then we cannot simply just blindly multiply the 6 die outcomes by the same 1/2. You can swap this whole reasoning to 1/2 * 1/6 too (independence is a symmetric relation).

Even if you really think things should be dependent based on personal experience, if the coin toss distribution remains at 1/2 for each outcome no matter what die outcome, then in probability theory, they are considered independent, and if you want to use that theory, that is the correct terminology for communication.

In your original problem, if there are 100 words, the tree will be 100 edges wide (How they are ordered from left to right doesn’t matter). Each column of edges represent the probability distribution of 1 word. Finding the probability of 1 set of 100 words in a prediction sample involves choosing 1 word from each of the 100 columns of edges, looking at that 1 words’ probability in the probability distribution (PMF) of that column, repeating this 100 times going from left to right, and multiplying them together. This whole process is done 1x for each class {Spam , Not Spam}.


That’s really a clear and detailed explanation.
I think I’ve got some points.
But, still, I think I need some time to dive in it.