Building a Spam Filter with Naive Bayes in R

3 Likes

Has anyone else tried to running the ‘calculating the parameter’ section as found in the solution and have the code run far longer than normal (i.e. like literally hours)?

4 Likes

I’ve had to give up on that project because of it
I had tried to find a way to vectorize those for loops (My MatLab knowledge lead me on that pathway)

here is the thread about that

2 Likes

Hi @michael.hoang17, @manandwa,

I agree with you, this project is taking a very long time to run in our environment and is often resulting in a dead kernel. I have reported this issue to our engineering team. Have you tried running it in your local environment?

Best,
Sahil

1 Like

Hi @Sahil, the issue came from running it in my local environment (as an fyi unlike the Python pathways R projects don’t provide any ide to complete code). I’ve raised the issue in a discussion with @casey about vectorizing the for loops. for the python version of the project this doesn’t seem to be an issue. I’d be happy to help the engineering team with a solution

2 Likes

Thanks for the reply back @Sahil.

Yes, as with @manandwa, I was actually running this from my local environment and it was literally taking hours. This also was a problem with the following Jeopardy project that I had completed as well. I thought about a slightly better workaround with the use of lapply but it was still like watching paint dry.

Just to echo what I found through a Google search, there needs to be some kind of vectorizing the For Loops to expedite this. Someone I know did mention about the use of Google Labs as a possible workaround of the limitations of your computing system but I really have no idea if that works or not. I think @casey may have some ideas on the matter.

3 Likes

Hi @manandwa and @michael.hoang17. I was out of the office last week. Writing to let you know that I’ve seen this post as well as the other one. I agree that vectorization should probably be used here to expedite code running times. We’ll look into this.

Best,
-Casey

2 Likes

Would love to hear more about this. I’ve reached this point of the project yesterday and after two attempts running longer than half an hour I have to shelf this project for now.

2 Likes

Hi All,

We have improved the solution code to make it run faster. Thank you for letting us know about it.

Best,
Sahil

Hi, here’s my take on the project. I used the original dataset.
I also used for loops to build the classifier and for grid search and though they certainly took quite a bit of time to run (the grid search , called cross validation in the project, took ~ 2 minutes) my code run quite well.

spam_filter/spamfilter.pdf at master · teorems/spam_filter (github.com)

Hello - help please I am stuck! I am trying to work on this project, but I get an error message that the function only works on a numeric data: ‘Error in FUN(X[[i]], …) :
only defined on a data frame with all numeric variables’

library(tidyverse)
set.seed(1)

spam <- read_csv("spam.csv")
n <- nrow(spam)
n_training <- 0.8 * n
n_cv <- 0.1 * n
n_test <- 0.1 * n

train_indices <- sample(1:n, size = n_training, replace = FALSE)
remaining_indices <- setdiff(1:n, train_indices)
cv_indices <- remaining_indices[1:(length(remaining_indices)/2)]
test_indices <- remaining_indices[((length(remaining_indices)/2) + 1):length(remaining_indices)]
# Use the indices to create each of the datasets
spam_train <- spam[train_indices,]
spam_cv <- spam[cv_indices,]
spam_test <- spam[test_indices,]


tidy_train <- spam_train %>% 
  mutate(
     Take the messages and remove unwanted characters
    sms = str_to_lower(sms) %>% 
      str_squish %>% 
      str_replace_all("[[:punct:]]", "") %>% 
      str_replace_all("[\u0094\u0092\u0096\n\t]", "") %>% # Unicode characters
      str_replace_all("[[:digit:]]", "")
  )

I checked the code with solutions notebook and it’s correct - is this something to do with the csv file?

Many thanks,
L