Building a Spam Filter with Naive Bayes: series sum being treated as list. Why?

Screen Link:
https://app.dataquest.io/m/433/guided-project%3A-building-a-spam-filter-with-naive-bayes/7/calculating-parameters

My Code:

SMS spam filter
In this project we'll build a spam filter for SMS messages, capable of distinguishing messages between spam and non spam.

We'll do this using as reference a set of messages previously classified as spam or non spam by humans.

Below, we'll install the libraries needed for our analysis.

import pandas as pd
import numpy as np
import re
import json
Now, we'll load the collection into pandas DataFrame format.

collection = pd.read_csv('SMSSpamCollection', sep='\t', header=None)
Let's check the DataFrame to know it better.

collection.head()

It looks like we could have columns better named. Let's rename them to "Spam" and "SMS".

collection.columns = ['label','sms']
Now, time to check if it's ok.

Everything looks fine. Lets check the size of our DataFrame.

collection.shape
(5572, 2)
We have 5.572 rows and 2 columns on this dataset. Let's check the proportion of spam and non spam messages.

collection['label'].value_counts()
ham     4825
spam     747
Name: label, dtype: int64
total_rows = collection['label'].value_counts()[0]+collection['label'].value_counts()[1]
ham = collection['label'].value_counts()[0]/total_rows
spam = collection['label'].value_counts()[1]/total_rows
ham
0.8659368269921034
spam
0.13406317300789664
Around 87% of this dataset is non spam, and roughly 13% is spam messages.
With the data we have, we'll have to create two subsets of data:

One for training our program
Other for testing its efficiency
We're randomly taking 80% of our set to train, and 20% to future testing, keeping the original ham x spam ratio. We're doing this below.

training_set = collection.sample(n=4458, random_state=1)
test_set = collection.sample(n=1114, random_state=1)
training_set['label'].value_counts()
ham     3858
spam     600
Name: label, dtype: int64
test_set['label'].value_counts()
ham     967
spam    147
Name: label, dtype: int64
training_set['label'].value_counts()[0]/4458
0.8654104979811574
test_set['label'].value_counts()[0]/1114
0.8680430879712747
We have both set ready, and kept the ham x spam ratio.

Data Cleaning
We'll now cleanse the training dataset, removing punctuations and making it all lower case.

training_set['sms'] = training_set['sms'].str.replace('\W', ' ')
training_set['sms'] = training_set['sms'].str.lower()
training_set.head()
label	sms
1078	ham	yep by the pretty sculpture
4028	ham	yes princess are you going to make me moan
958	ham	welp apparently he retired
4642	ham	havent
4674	ham	i forgot 2 ask ü all smth there s a card on ...
training_set.columns = ['label', 'text']
Let's create our vocabulary.

vocabulary = []
training_set['text'] = training_set['text'].str.split()
for phrase in training_set['text']:
    for word in phrase:
        vocabulary.append(word)
Now, we use a trick to remove duplicates on the vocabulary list. We transform it in a set, and then back to a list.

vocabulary = list(set(vocabulary))
vocab_len = int(len(vocabulary))
vocab_len
7783
Let's now create a table with the word count.
The code below creates a dictionary, with as many zeroes (filler value) as the len of the training set rows.

word_counts_per_sms = {unique_word: [0] * len(training_set['text']) for unique_word in vocabulary}
The code below will fill the dictionary created above.

It will loop through training_set's index and sms rows, and inside this, will loop for each word, and append it to the word_count_per_sms dictionary.

for index, sms in enumerate(training_set['text']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
Let's now check the lenght of the word counter:

len(word_counts_per_sms)
7783
Look ok. We'll now transform the word count on a Pandas DataFrame for ease of use.

word_counts = pd.DataFrame(word_counts_per_sms)
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Probabilities
We'll now calculate the probabilities we need in order to create or spam filter using Naive Bayes algorythm. We'll calculate:

P(Spam) - Probabity of a SMS being spam
P(Non Spam) - Probability of not being Spam, named "ham"
N(Spam) - Number of words in all spam SMS
N(Ham) - Number of words in all ham SMS
ts_total_rows = training_set_clean.shape[0]
ts_total_rows
5341
p_spam = (training_set_clean[training_set_clean['label'] == 'spam'].shape[0])/ ts_total_rows
p_spam
0.11233851338700618
p_ham = (training_set_clean[training_set_clean['label'] == 'ham'].shape[0]) / ts_total_rows
p_ham
0.7223366410784497
spam_only = training_set_clean[training_set_clean['label'] == 'spam']
spam_only
n_spam = len(spam_only)
ham_only = training_set[training_set['label'] == 'ham']
n_ham = 0

for row in ham_only['text']:
    for word in row:
        n_ham += 1
n_ham
57237
We'll now use Naive Bayes to calculate the probability for every word, spam or not spam, and in order to do this properly we'll need a smoothing factor value equal to 1, according to Laplace Smoothing. Let's call it alpha.

alpha = 1
p_w_ham = {}
p_w_spam = {}
for word in vocabulary:
    n_word_spam = spam_only[word].sum()
    p_word_spam = (n_word_spam + alpha) / (n_spam + vocab_len)

What I expected to happen:

I was expecting to do operations with spam_only[word].sum() result, but for some reason it is being treated as a list, not a float or int, and I don’t get why.

What actually happened:

TypeErrorTraceback (most recent call last)
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/ops.py in na_op(x, y)
    675         try:
--> 676             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
    677         except TypeError:

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    203     if use_numexpr:
--> 204         return _evaluate(op, op_str, a, b, **eval_kwargs)
    205     return _evaluate_standard(op, op_str, a, b)

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     63     with np.errstate(all='ignore'):
---> 64         return op(a, b)
     65 

TypeError: can only concatenate list (not "int") to list

During handling of the above exception, another exception occurred:

TypeErrorTraceback (most recent call last)
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
    699             with np.errstate(all='ignore'):
--> 700                 return na_op(lvalues, rvalues)
    701         except Exception:

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/ops.py in na_op(x, y)
    685                 mask = notna(x)
--> 686                 result[mask] = op(x[mask], y)
    687             else:

TypeError: can only concatenate list (not "int") to list

During handling of the above exception, another exception occurred:

TypeErrorTraceback (most recent call last)
<ipython-input-42-234df470d3bc> in <module>()
      1 for word in vocabulary:
      2     n_word_spam = spam_only[word].sum()
----> 3     p_word_spam = (n_word_spam + alpha) / (n_spam + vocab_len)

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/ops.py in wrapper(left, right, name, na_op)
    737                 lvalues = lvalues.values
    738 
--> 739         result = wrap_results(safe_na_op(lvalues, rvalues))
    740         return construct_result(
    741             left,

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
    708                 if is_object_dtype(lvalues):
    709                     return libalgos.arrmap_object(lvalues,
--> 710                                                   lambda x: op(x, rvalues))
    711             raise
    712 

pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.arrmap_object()

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/ops.py in <lambda>(x)
    708                 if is_object_dtype(lvalues):
    709                     return libalgos.arrmap_object(lvalues,
--> 710                                                   lambda x: op(x, rvalues))
    711             raise
    712 

TypeError: can only concatenate list (not "int") to list
1 Like

Hey @arturvieirasousa

  1. Please share your code like you have shared now when it is small!

  2. When the code is too big, like above, please share the notebook itself, it enables other community members to help faster and better. this post is helpful to do that.

  3. Coming to the issue, have you checked this part? you started with total rows “4458” in both training set and vocabulary dictionary (to df) but then total rows are coming to 5341 in the combined df!

Work on this part of the code, then try again. Let us know if you face the same issue or a different one this time.

1 Like