Screen Link:
https://app.dataquest.io/m/433/guided-project%3A-building-a-spam-filter-with-naive-bayes/7/calculating-parameters
My Code:
SMS spam filter
In this project we'll build a spam filter for SMS messages, capable of distinguishing messages between spam and non spam.
We'll do this using as reference a set of messages previously classified as spam or non spam by humans.
Below, we'll install the libraries needed for our analysis.
import pandas as pd
import numpy as np
import re
import json
Now, we'll load the collection into pandas DataFrame format.
collection = pd.read_csv('SMSSpamCollection', sep='\t', header=None)
Let's check the DataFrame to know it better.
collection.head()
It looks like we could have columns better named. Let's rename them to "Spam" and "SMS".
collection.columns = ['label','sms']
Now, time to check if it's ok.
Everything looks fine. Lets check the size of our DataFrame.
collection.shape
(5572, 2)
We have 5.572 rows and 2 columns on this dataset. Let's check the proportion of spam and non spam messages.
collection['label'].value_counts()
ham 4825
spam 747
Name: label, dtype: int64
total_rows = collection['label'].value_counts()[0]+collection['label'].value_counts()[1]
ham = collection['label'].value_counts()[0]/total_rows
spam = collection['label'].value_counts()[1]/total_rows
ham
0.8659368269921034
spam
0.13406317300789664
Around 87% of this dataset is non spam, and roughly 13% is spam messages.
With the data we have, we'll have to create two subsets of data:
One for training our program
Other for testing its efficiency
We're randomly taking 80% of our set to train, and 20% to future testing, keeping the original ham x spam ratio. We're doing this below.
training_set = collection.sample(n=4458, random_state=1)
test_set = collection.sample(n=1114, random_state=1)
training_set['label'].value_counts()
ham 3858
spam 600
Name: label, dtype: int64
test_set['label'].value_counts()
ham 967
spam 147
Name: label, dtype: int64
training_set['label'].value_counts()[0]/4458
0.8654104979811574
test_set['label'].value_counts()[0]/1114
0.8680430879712747
We have both set ready, and kept the ham x spam ratio.
Data Cleaning
We'll now cleanse the training dataset, removing punctuations and making it all lower case.
training_set['sms'] = training_set['sms'].str.replace('\W', ' ')
training_set['sms'] = training_set['sms'].str.lower()
training_set.head()
label sms
1078 ham yep by the pretty sculpture
4028 ham yes princess are you going to make me moan
958 ham welp apparently he retired
4642 ham havent
4674 ham i forgot 2 ask ü all smth there s a card on ...
training_set.columns = ['label', 'text']
Let's create our vocabulary.
vocabulary = []
training_set['text'] = training_set['text'].str.split()
for phrase in training_set['text']:
for word in phrase:
vocabulary.append(word)
Now, we use a trick to remove duplicates on the vocabulary list. We transform it in a set, and then back to a list.
vocabulary = list(set(vocabulary))
vocab_len = int(len(vocabulary))
vocab_len
7783
Let's now create a table with the word count.
The code below creates a dictionary, with as many zeroes (filler value) as the len of the training set rows.
word_counts_per_sms = {unique_word: [0] * len(training_set['text']) for unique_word in vocabulary}
The code below will fill the dictionary created above.
It will loop through training_set's index and sms rows, and inside this, will loop for each word, and append it to the word_count_per_sms dictionary.
for index, sms in enumerate(training_set['text']):
for word in sms:
word_counts_per_sms[word][index] += 1
Let's now check the lenght of the word counter:
len(word_counts_per_sms)
7783
Look ok. We'll now transform the word count on a Pandas DataFrame for ease of use.
word_counts = pd.DataFrame(word_counts_per_sms)
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()
Probabilities
We'll now calculate the probabilities we need in order to create or spam filter using Naive Bayes algorythm. We'll calculate:
P(Spam) - Probabity of a SMS being spam
P(Non Spam) - Probability of not being Spam, named "ham"
N(Spam) - Number of words in all spam SMS
N(Ham) - Number of words in all ham SMS
ts_total_rows = training_set_clean.shape[0]
ts_total_rows
5341
p_spam = (training_set_clean[training_set_clean['label'] == 'spam'].shape[0])/ ts_total_rows
p_spam
0.11233851338700618
p_ham = (training_set_clean[training_set_clean['label'] == 'ham'].shape[0]) / ts_total_rows
p_ham
0.7223366410784497
spam_only = training_set_clean[training_set_clean['label'] == 'spam']
spam_only
n_spam = len(spam_only)
ham_only = training_set[training_set['label'] == 'ham']
n_ham = 0
for row in ham_only['text']:
for word in row:
n_ham += 1
n_ham
57237
We'll now use Naive Bayes to calculate the probability for every word, spam or not spam, and in order to do this properly we'll need a smoothing factor value equal to 1, according to Laplace Smoothing. Let's call it alpha.
alpha = 1
p_w_ham = {}
p_w_spam = {}
for word in vocabulary:
n_word_spam = spam_only[word].sum()
p_word_spam = (n_word_spam + alpha) / (n_spam + vocab_len)
What I expected to happen:
I was expecting to do operations with spam_only[word].sum() result, but for some reason it is being treated as a list, not a float or int, and I don’t get why.
What actually happened:
TypeErrorTraceback (most recent call last)
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/ops.py in na_op(x, y)
675 try:
--> 676 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
677 except TypeError:
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
203 if use_numexpr:
--> 204 return _evaluate(op, op_str, a, b, **eval_kwargs)
205 return _evaluate_standard(op, op_str, a, b)
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
63 with np.errstate(all='ignore'):
---> 64 return op(a, b)
65
TypeError: can only concatenate list (not "int") to list
During handling of the above exception, another exception occurred:
TypeErrorTraceback (most recent call last)
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
699 with np.errstate(all='ignore'):
--> 700 return na_op(lvalues, rvalues)
701 except Exception:
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/ops.py in na_op(x, y)
685 mask = notna(x)
--> 686 result[mask] = op(x[mask], y)
687 else:
TypeError: can only concatenate list (not "int") to list
During handling of the above exception, another exception occurred:
TypeErrorTraceback (most recent call last)
<ipython-input-42-234df470d3bc> in <module>()
1 for word in vocabulary:
2 n_word_spam = spam_only[word].sum()
----> 3 p_word_spam = (n_word_spam + alpha) / (n_spam + vocab_len)
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/ops.py in wrapper(left, right, name, na_op)
737 lvalues = lvalues.values
738
--> 739 result = wrap_results(safe_na_op(lvalues, rvalues))
740 return construct_result(
741 left,
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
708 if is_object_dtype(lvalues):
709 return libalgos.arrmap_object(lvalues,
--> 710 lambda x: op(x, rvalues))
711 raise
712
pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.arrmap_object()
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/ops.py in <lambda>(x)
708 if is_object_dtype(lvalues):
709 return libalgos.arrmap_object(lvalues,
--> 710 lambda x: op(x, rvalues))
711 raise
712
TypeError: can only concatenate list (not "int") to list