My project on building a spam filter using Multinomial Naive Bayes

Hello DQ community, I’m sharing yet another project with you. I also got to try the function of this project in the real world by testing it on text messages my mum received.

here’s a link to the last mission screen
Learn data science with Python and R projects

also a link to my notebook.
SMS spam filter.ipynb (51.2 KB)

Click here to view the jupyter notebook file in a new tab


Hi @abomayesan, thanks for sharing your project with the Community :slight_smile: Your project is very easy-to-read and understand. And well done on trying out the algorithm on the real-world dataset, what’s your result?

A few comments from my side:

  • It’s better to import all the modules in the first code cell so that we are aware of which one you use in the project
  • I don’t know why you have a link-styled section named “Initial Data Exploration”?
  • It’s better not to use a full stop at the end of each section name
  • Some code comments (like # transforming the vocabulary variable from a set back to a list) are too obvious, when commenting you should assume that the reader has basic knowledge of Python and only focus on explaining the most difficult parts of the code. For example, comment on creating a dictionary of unique words is good but you may also consider adding some explanation of how it works (although not strictly necessary)
  • Why does your clean_training_set have two additional rows (7785 instead of 7783)?
  • You may consider displaying the content of the spam_dict and ham_dict dictionaries
  • Good on using docstrings but consider using one of the common styles such as NumPy style
  • The same goes for code comments, they should form a uniform style (such as a space after the hash sign, #, or start from Capital case)
  • The functions classify and classify_test_set are nearly identical, do you really need both of them?
  • There is an easier way to compute the percentage of correct hits: just divide the length of a masked DataFrame where classification == "incorrect" and divide this number by the length of the original DataFrame
  • You say a few things to improve the algorithm but then mention only one option. Do you have any other ideas on how to improve it?

I hope my feedback was helpful. Happy coding :slight_smile:

1 Like

Thank you. Your feed back was amazing and I’ll continue to keep learning.

As for the result of the algorithm, when I had my friends play with it, it had one major weakness, it classified almost every promotional SMS as a spam SMS. So it made me think of another way of improving it. To train the algorithm on more data by collecting data on how my friends classify messages as spam or non-spam.

1 Like