Curious about solution note book build a spam filter

,

My Code:

``````present_probs <- word_counts %>%
filter(word %in% words) %>%
mutate(
# Calculate the probabilities from the counts
spam_prob = (spam_count + alpha) / (n_spam + alpha * n_vocabulary),
ham_prob = (ham_count + alpha) / (n_ham + alpha * n_vocabulary)
``````

this code calculated probability with unique n_spam why it have to be unique it should be “all possible word” am I getting it right ?

and I know
this below line of code use unique to count the numerator in this ->spam_prob = (spam_count + alpha) / (n_spam + alpha * n_vocabulary)

``````spam_counts <- tibble(
word = spam_vocab
) %>%
mutate(
# Calculate the number of times a word appears in spam
spam_count = map_int(word, function(w) {

# Count how many times each word appears in all spam messsages, then sum
map_int(spam_messages, function(sm) {
(str_split(sm, " ")[[1]] == w) %>% sum # for a single message
}) %>%
sum # then summing over all messages

})
)
``````

but what I really curious is the denomenator in this --> spam_prob = (spam_count + alpha) / (n_spam + alpha * n_vocabulary) why the n_spam denominator still unique ? should it be all words?

because here in lesson you teach all word??? then why in project you use unique()

1 Like