My Code:

``````present_probs <- word_counts %>%
filter(word %in% words) %>%
mutate(
# Calculate the probabilities from the counts
spam_prob = (spam_count + alpha) / (n_spam + alpha * n_vocabulary),
ham_prob = (ham_count + alpha) / (n_ham + alpha * n_vocabulary)

``````

What I expected to happen:
in n_spam and n_vocabulary why in the solution have to unique() the word in n_spam and n_vocab because in the formula its has to be all probability in all word but why you do probability from unique

What actually happened:
pls ans thx

I think the program uses this code to get the unique number of words in spam, since you do not want to calculate the probability of a word more than once.

``````spam_vocab <- spam_vocab %>% unique
``````

The program uses this code below to count the number of these unique words in spam that are in the total spam messages. You pick a word in spam, say offer, you count how many times this word occur in the spam messages.

``````spam_counts <- tibble(
word = spam_vocab
) %>%
mutate(
# Calculate the number of times a word appears in spam
spam_count = map_int(word, function(w) {

# Count how many times each word appears in all spam messsages, then sum
map_int(spam_messages, function(sm) {
(str_split(sm, " ")[[1]] == w) %>% sum # for a single message
}) %>%
sum # then summing over all messages

})
)
``````

present_probs <- word_counts >
filter(word in words) >
mutate(
# Calculate the probabilities from the counts
spam_prob = (spam_count + alpha) / (n_spam + alpha * n_vocabulary),
ham_prob = (ham_count + alpha) / (n_ham + alpha * n_vocabulary)
but when we calculate probability the denominator should be “all word” am I getting it right ?