Spam filter not working properly, traced the issue to slide 5 (word_counts_per_sms)

Hello everyone.

Screen Link:

My Code:

vocabulary = []

train['SMS']=train['SMS'].str.split()

for sms in train['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()


What I expected to happen:


What actually happened:

Please see the first reply, that’s the screenshot. It’s not uploading here for some reason. Also, the Kernel keeps dying every 3 minutes.

Something wrong with this answer and it propagates through the rest of my work and I end up not having a properly working filter. Whatever is the issue it is here but I can’t find it although I have looked at it closely. Would appreciate any help.

https://app.dataquest.io/m/433/guided-project%3A-building-a-spam-filter-with-naive-bayes/8/classifying-a-new-message
2 Likes

2 Likes

image

I cannot find the problem , but in dictionary in old python versions sequence may vary , in new python version the order is intact.

As per your screenshot , I can only guess the order was changed but there is no problem in dataframe created.

Hey Eashwary,

Thanks so much for your response. The issue is that it doesn’t filter the spam properly either. I will keep looking at it but I appreciate anyone else who might have had similar issues to give me an insight.

Cheers
Ramin

Hello!

I had a problem here as well. After investigating, in the for loop it was iterating over the characters in the string instead of the words.

For example:
sms = ‘sms like’
for word in sms:
print(word)
output = s,m,s, ,l,i,k,e
instead of = sms, like

I fixed this by doing:
for word in sms.split():

Hello!I have the same problem ! did you manage to solve it?

I am having the same issue after converting the dictionary to the DataFrame. Still looking for ways to fix this.

Hi,

Were you able to figure out the issue?
I have been facing the same issue. I rechecked the entire code, . Even compared the code to the solution notebook. I cannot seem to find any issue. The only difference is the table for word counts.
I have not been sblr to figure out the reason yet.

There really doesn’t seem to be an issue here with the table. The table only looks different because the columns are ordered differently. It’s still the same table. Why this happens is to do with Python Dictionaries.

What exactly is the issue that you are facing other than that?

1 Like

Hi,

I am facing trouble while classifying the message as Spam or Ham. Both the trial messages get classified as ham. I cross checked with the solution as well. Only significant difference I found is this particular table. Hence, I wanted to verify if this was a llegitimate issue.

Ohk. Then I would suggest creating a new question about this and in that new post, either attach your project’s Jupyter Notebook file (Sharing Your Guided Project in the Community), or you can upload your Notebook to a Github repository and share the link to your repository.

I can then try to look at the code later and see what the issue might be.

Hi,

Sure. I will do that. Thank you so much for taking the time to help me with this.

The gentleman that originated this tread summed up the problem I am having perfectly! I have been over it many times and look at the solution but the problem remains. Is there a solution to this?

I notice the moderator wanted to start a new thread to solve. I have looked and cannot find a new thread. If all the information is here, why do we need a new thread. Can we please solve it here?

Any help would be greatly appreciated. Thank you.

I also had this problem. I was able to resolve it by specifying the column names when I was forming the DataFrame.

word_count = pd.DataFrame(word_counts_per_sms, columns = vocabulary)

Hello everyone,

I also had a similar problem. The code appeared to be the same as in the solution (at least in critical parts), but the classifier worked incorrectly, e.g. returning ‘Ham’ for both testing messages.

The issues with this project have nothing to do with the order of columns from slide 5.

My advices are:

  1. Carefully check if the replacements made using RegEx are 100% correct . It is very easy to write a pattern that does something different from what you expect.
  2. Carefully (I say, CAREFULLY) check the code parts where the parameters are calculated via formulas. There are a lot of CTRL + C, CTRL + V in this project (because we have to calculate similar parameters for both Spam and Ham), so it is very easy to fail when replacing the Spam parameters with the Ham parameters. In my case, P(wi|Ham) was calculated incorrectly because there was ‘n_spam’ and not ‘n_ham’ in the denominator. It was very hard to notice this bug.

Hope this helps.