Sharing "Building a Spam Filter with Naive Bayes"

Hello Community,

Just now, I completed Dataquest’s guided project “Building a Spam Filter with Naive Bayes”. This is to link to the last page of the guidelines:

This was a very nice project, and I was actually quite surprised by the outcome. Here is my notebook:
BuildingSpamFilterNaiveBayes.ipynb (215.4 KB)
A link to show this in Notebook viewer can be found at the bottom of this post.

As always, any feedback is always welcome and appreciated!

Best regards,

Click here to view the jupyter notebook file in a new tab


Hi @jasperquak

Cool project :+1: This project showcases your efforts to build a good narrative. The detailed explanation plus the write-up in a semi-formal-talking-to-a-peer manner makes it interesting too.

I won’t go into many details, but have you given a thought about rounding off the probabilities and then comparing them in the classify function? Should we expect similar or different results depending on the precision selected?
Based on the conclusion section, do you suppose this is the only method with which we can classify messages with the accuracy achieved? Have you explored alternative theories/ methods/ algorithms? (I haven’t done this, so asking as info and not as feedback :grin:)

Your intro and overall structure of the project are awesome and well-organized. Plus the comments for almost each of the code cells summarizing what the code is doing :ok_hand:!

Please keep up your great work and thank you for sharing your project with DQ!

Hi @Rucha

Thank you for your feedback!

And let me reply to your good questions.

What I considered is doing something along lines like “if p_spam and p_notspam for a message are relatively close to each other, then maybe it is too close to call”. In which “close to each other” would be one divided by the other I’d say.
However, when seeing some numbers, I concluded that this may not work. Cell [55] shows examples (e.g. records 7 and 62) where p_spam and p_notspam are relatively close to each other, but the classification was still correct. While cell [57] shows examples (e.g. records 869, 2965) where p_spam and p_notspam differ much further but the classification was still incorrect. Seeing that made me dismiss this option, as it made me expect that it would be very hard to reach a higher accuracy with such measure. I did not analyze it deeply though. Or maybe my thinking was incorrect?

No, I haven’t either. It’s always a trade-off between taking your current project further, or continuing with next lessons and projects… I did the latter in this case.