Vectorizing the for loop to check for spam words in vocabulary

, ,

Screen Link: https://app.dataquest.io/m/475/guided-project%3A-building-a-spam-filter-with-naive-bayes/6/calculating-parameters

I’m trying to vectorize this code as having run as it is causes my pc to run out of memory

This is due to the fact that the both the vocabulary, spam and ham vectors are large.

This code is similar to the solution code but I know that in the beginning parts of the R pathway it is stated
use vectorized functions instead of for loops whenever possible. And having looked the python version this wasn’t an issue (was trying to look up list comprehensions that could be done in r so that it would speed up)

My Code: (it is commented now as it was slowing down my system)

# for (v in vocabulary) {
# 
#   # Initialize count variables
#   spam.counts[[v]] <- 0
#   ham.counts[[v]] <- 0
# 
#   # Cycle through spam messages and count how many times that word appears
#   for (s_m in spam.messages) {
#     s_w <- str_split(s_m, " ")[[1]]
#     spam.counts[[v]] <- spam.counts[[v]] + sum(s_w == v)
#   }
# 
#   # Cycle through non-spam (ham) messages and count how many times that word appears
#   for (h_m in ham.messages) {
#     h_w <- str_split(h_m, " ")[[1]]
#     ham.counts[[v]] <- ham.counts[[v]] + sum(h_w == v)
#   }
# 
#   # Calculate the probabilities using the counts
#   spam.probs[[v]] <- (spam.counts[[v]] + alpha) / (n.spam + alpha * n.vocabulary)
#   ham.probs[[v]] <- (ham.counts[[v]] + alpha) / (n.ham + alpha * n.vocabulary)
# }

Update: I’ve run the loop but it still takes forever, (and also caused 99% cpu usage :smiley: If this can’t be vectorized, then is there a way to combine vector operations and use a single for loop? I’ve given up on the project as this is a blocker for the rest of it (I just completed the project mission but have stopped coding on it)

1 Like