Guided Project: Building a Spam Filter with Naive Bayes (Probability)

Hi, I am working on the guided project for spam filter using Naive Bayes in the probability section. In the project, we have to convert the SMS in the desired format. Here’s the mission link.

image

In order to do that, it’s written in the mission to run the following code:

word_counts_per_sms = {unique_word: [0] * len(training_set[‘SMS’]) for unique_word in vocabulary}

for index, sms in enumerate(training_set[‘SMS’]):
for word in sms:
word_counts_per_sms[word][index] += 1

Though there is explanation provided in the mission, I am unable to understand what exactly we are doing here.

It would be great if someone can simplify it more. Thanks!

1 Like

Hi Sandesh,

I haven’t seen the mission, but i’ll try to explain in sufficient detail and hopefully i don’t repeat what the mission already says.
The goal is to create new columns, with each row’s value (after transformation) showing the number of times the word represented by a column appears in the SMS of that row.

The first line with word_counts_per_sms is a dictionary comprehension. You can recognize that by looking out for the for a in b syntax wrapped by {} with a : in the middle. Without the :, there’s no key-value pair created and it becomes a set comprehension instead.
This step creates an empty list of 0 for each unique_word (which will become the columns later). This code is not efficient because it calculates len(training_set) repeatedly during the comprehension loop. You could do it once outside the loop, saving it as a constant, and use it in the loop. These empty 0 will later be overwritten with counts as later code iterates through the column of SMS.
The 2nd block’s outer for loop uses enumerate to add a counter to any iteration which you can use to control other things, in here it is used to synchronize the iteration of SMS column and the list of 0 in word_counts_per_sms[word]. The outer for loop extracts sms which is a single sms from the column of SMS. The inner for loop takes a single sms and iterates through the words 1 by 1. Every word here within 1 row, and for all other rows should match one of the unique_word previously used as the key to create word_counts_per_sms. word_counts_per_sms[word][index] indexes into the correct list first corresponding to the word, then adds 1 count to the correct position in the list using index. Every place in the list represents 1 row, and there are len(training_set[‘SMS’] rows defined previously. It’s strange though i don’t see where is the step where the sms in a single row is split on white space into constituent words. I was expecting sms.split() to happen somewhere before the inner loop.

Allocating memory early on like here is beneficial to prevent wasting time when the program realizes that the space it has allocated to you is insufficient and has to create a new array of double the size and copy the items from the previous array to the new one. https://www.geeksforgeeks.org/how-do-dynamic-arrays-work/ This is a common issue of all dynamic-sized arrays (meaning arrays that grow in size automatically so you don’t have to care about allocation at first.). Such a technique is used when coding neural networks with numpy arrays too.

When you understand this, try CountVectorizer, does the same but much faster: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html. This is used in NLP modelling tasks.

2 Likes

Thank you. This helped a lot!

1 Like