Guided project - Building a spam filter with naive bayes

Hi! I’m Ana and I finished my guided project about naive bayes. I followed the instructions and at the end I used scikit-learn model to test it too. It’s my first time using this library and I’m unsure about its procedures. If anyone could give me a feedback about it I would be very happy and it would help a lot :smiling_face:

The link to the notebook:

1 Like

@analuizallmp congrats on completing the project :handshake:. At the outset, I’ve got to mention that there are a couple of points that sets this project apart:

  • Yours is the first project where I have seen an attempt to identify the cause behind the false positive. This may also be because I have not reviewed many others who have done the same project. I did analyse the same for this project.
  • I liked how you went ahead with using the sci-kit library for the classification, going so far as to use hyper-parameter optimization and seeing some positive results from the same. I have never used that library and was not aware of its existence.

That being said, I have a few pointers that I hope will improve your project. (please click on the bullet triangles below for more details.)

Presentation Style
  • In the Implementation section you mention that the algorithm will work based on two equations. I would encourage you to expand on it, albeit in your own words. Interviewers usually catch these things because it reflects a lack of understanding. If they see an explanation, it shows that you’ve made an attempt to explain and thus validate your own understanding, which is appreciable.
  • I noticed that your mark-up failed to sub-script the text in the same section. You might want to take a look at it. I’m talking about this part:
where we need the constants:

    Nwi|Spam</sub> — the number of times the word wi occurs in spam messages;
    Nwi|Ham</sub> — the number of times the word wi occurs in ham messages;

Try to check whether you can sort this out. If, you are unable to do so, let me know. I did get it to work properly on my file.

  • Ensure to round your output, especially the percentages like the result of cell[37]. They make the project aesthetically pleasing.
Coding Style
  • Nothing to add here besides what I mentioned earlier regarding the use of the related libraries. That is commendable.
  • I haven’t gone too deep in to the code as you have most of the expected results.
  • I would suggest that you look for another dataset similar to this one and attempt to do the same again. This will help to strengthen your understanding and help to find out whether the excellent results you obtained here hold out for other datasets of a similar type.

Overall you have done a commendable job :100: and I hope to see more as you go forward. Keep going :rocket:


Thank you for you feedback, it helps a lot the understanding of the project! I’ll work to solve the problems :wink: