Spam filter for naive bayes - data transformation question

Hi dataquest team,
I am working on the Spam filter project, and have a syntax (I guess?) question for the part where we build a dictionary of words from the sms to assign the count of words for each sms.

Take this as a sample of the dataset,

#prepare a sample dataframe
sms = ['yep  by the pretty sculpture',
       'yes  princess  are you going to make me moan ',
       'welp apparently he retired', 'havent ']

df = pd.DataFrame(data = sms, columns = ['sms'])
df['sms_list'] = df['sms'].str.split()

df

#prepare a vocabulary of unique words
v = []

for row in df['sms']:
    words = row.split()
    for w in words:
        v.append(w)
            
 v = list(set(v))

Ok, now I need to transform the vocabulary - which is the troublesome part.
This is basically the answer from the solutions notebook.

Working solution
word_counts_per_sms = {unique_word: [0] * len(df['sms_list']) for unique_word in v}

for sms_index, sms_words in enumerate(df['sms_list']):
    for word in sms_words:
        word_counts_per_sms[word][sms_index] += 1 

right = pd.DataFrame(word_counts_per_sms)       
right

I wrote something slightly different that, for god’s sake, will give me the same count for each column, and I swear I cannot understand why.

Wrong and I don't know why
list_of_zeroes = [0] * len(df['sms_list'])
word_counts_per_sms = {unique_word: list_of_zeroes for unique_word in v}

for sms_index, sms_words in enumerate(df['sms_list']):
    for word in sms_words:
        word_counts_per_sms[word][sms_index] += 1 

wrong = pd.DataFrame(word_counts_per_sms)
wrong
2 Likes

Hi @nlong,

The reason why this is happening is because, your code is using the same list instance for every unique_word.

When do do this:

list_of_zeroes = [0] * len(df['sms_list'])
word_counts_per_sms = {unique_word: list_of_zeroes for unique_word in v}

You are using the same instance, list_of_zeroes, for every unique_word. This is not creating a copy of the list. Therefore, whenever you change the value in the list for one word, this is being added for all words.

Here is a simpler example of this behavior:

my_list = [0, 0]
list_of_lists = [my_list, my_list]

list_of_lists[0][0] = 1
list_of_lists[0][1] = 2

print(list_of_lists[0])
print(list_of_lists[1])
[1, 2]
[1, 2]

As you can see, we are only explicitly changing list_of_lists[0] but list_of_lists[1] also changes. The reason is the same. We used the same list twice.

Let me know if this helps. If you have any other questions, let me know.

Best regards.

3 Likes

Subtle!!! Thanks @Francois this is clear now.

Out of curiosity, I was wondering if there is a better way to deliver that same end result or to process dataframes of this size.
I am not a super fan of nested loops - and there is tons of articles that say that for loops are laggards (https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06 was published in the latest dq newsletter), yet I lack the knowledge to understand what would be a better choice to perform the same operation, faster.

yesterday while experimenting on the whole dataframe (which is 5000+ x 7000+ unique words) I run into several memoryerrors.

It is true that, in general, Python is a slow language. This is not just the case for for loops.

The reason why for loops are faster in other Python libraries is that those libraries are implemented in faster languages, like c++.

However, that is not a reason for avoiding doing for loops at all. If you are writing very critical code for some real time application then you will probably want to make it run as fast as possible. However, I don’t think this applies to this guided project.

There is another problem here that has nothing to do with that. On screen 5 you are asked to do this transformation:

image

If you have 5000 messages and 7000 unique words, this will be a huge table with 35 million entries. This is why you get memory errors.

I would advise you to skip this step and use a dictionary instead where the keys are the words and the values are the word counts. I’ve implemented the project in this way and it runs in just a couple of seconds to compute everything.

1 Like

Good point, I was not super convinced about the dataframe structure, even though I find it easier to visualize / manipulate.

If I were to write a single dictionary then I would need to use words as keys and lists / arrays of word counts as values,right?
I’ll take a look at your code!
A tweak I added was to create the dataframe not with lists of zeroes but rather witg numpy arrays - the whole operatiom performed much faster also on my side.

One last question: the whole project is working on a dataset, splitting into train and test data. However how would this algo work to classify new data? For instance, if I have to classify a new message that is composed only of words that do not exist in the vocabulary, how would the filter work with that?
I think it’s the whole purpose of machine learning :joy:

Regarding the dictionary:

No, the keys are words and the values are integers giving the word count. If you look at the formulas, you only need to be able to find how many times a given word occurs in the spam messages and how many times it occurs in the ham messages (of the training set).

So two dictionaries are enough: count_in_spam and count_in_ham. For example, count_in_spam['money'] should be equal to the number of times the word 'money' occurs in the spam messages of the training set.

Regarding the ML question:

Actually you already classified new data in the guided project. Assuming that the dataset was obtained by properly gathering a lot of unrelated SMS’s, the messages in the test set are unrelated to the other ones. Note that we do not use them at all to build our classifier and they can contain words that do not occur in training messages.

So in a real-life scenario, it all depends on whether or not your training set had enough data in it. If you have a good enough training set, it will work fine on new messages.

Whenever the spam filter is unable to classify a new message, then you should ask the user to classify it and update the spam filter by including the new messages in the training set.

You could use this idea to build a spam filter from an empty dataset. Initially, of course, it will be unable to classify any message because the vocabulary is empty. This means that new messages will have an equal probability of being spam or ham. In this case, we let the user label it.

When a message comes in, we compute the probabilities according to the current data. If the are the same then we ask the user to label the message and add that message to the training set and update the probabilities.