vocabulary = []
train['SMS']=train['SMS'].str.split()
for sms in train['SMS']:
for word in sms:
vocabulary.append(word)
vocabulary = list(set(vocabulary))
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}
for index, sms in enumerate(train['SMS']):
for word in sms:
word_counts_per_sms[word][index] += 1
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()
Something wrong with this answer and it propagates through the rest of my work and I end up not having a properly working filter. Whatever is the issue it is here but I can’t find it although I have looked at it closely. Would appreciate any help.
Thanks so much for your response. The issue is that it doesn’t filter the spam properly either. I will keep looking at it but I appreciate anyone else who might have had similar issues to give me an insight.
Were you able to figure out the issue?
I have been facing the same issue. I rechecked the entire code, . Even compared the code to the solution notebook. I cannot seem to find any issue. The only difference is the table for word counts.
I have not been sblr to figure out the reason yet.
There really doesn’t seem to be an issue here with the table. The table only looks different because the columns are ordered differently. It’s still the same table. Why this happens is to do with Python Dictionaries.
What exactly is the issue that you are facing other than that?
I am facing trouble while classifying the message as Spam or Ham. Both the trial messages get classified as ham. I cross checked with the solution as well. Only significant difference I found is this particular table. Hence, I wanted to verify if this was a llegitimate issue.
Ohk. Then I would suggest creating a new question about this and in that new post, either attach your project’s Jupyter Notebook file (Sharing Your Guided Project in the Community), or you can upload your Notebook to a Github repository and share the link to your repository.
I can then try to look at the code later and see what the issue might be.
The gentleman that originated this tread summed up the problem I am having perfectly! I have been over it many times and look at the solution but the problem remains. Is there a solution to this?
I notice the moderator wanted to start a new thread to solve. I have looked and cannot find a new thread. If all the information is here, why do we need a new thread. Can we please solve it here?
I also had a similar problem. The code appeared to be the same as in the solution (at least in critical parts), but the classifier worked incorrectly, e.g. returning ‘Ham’ for both testing messages.
The issues with this project have nothing to do with the order of columns from slide 5.
My advices are:
Carefully check if the replacements made using RegEx are 100% correct . It is very easy to write a pattern that does something different from what you expect.
Carefully (I say, CAREFULLY) check the code parts where the parameters are calculated via formulas. There are a lot of CTRL + C, CTRL + V in this project (because we have to calculate similar parameters for both Spam and Ham), so it is very easy to fail when replacing the Spam parameters with the Ham parameters. In my case, P(wi|Ham) was calculated incorrectly because there was ‘n_spam’ and not ‘n_ham’ in the denominator. It was very hard to notice this bug.
Just tried this and it works to make the table look more like the example in the page and solution but it didn’t help increase my end result accuracy at all.