Project Feedback: Naive Bayes Spam Filter

Hello DQ Community!

Do you have any feedback on my new project? How can I make my algorithm even more accurate?

One thing I noticed was encoding punctuation didn’t have much impact on the algorithm’s accuracy, did you find the same?

One thing that worked well was encoding phone numbers that were obviously attached to a business.

Let me know what you think!

https://app.dataquest.io/c/74/m/433/guided-project%3A-building-a-spam-filter-with-naive-bayes/10/next-steps

Building a Spam Filter with Naive Bayes.ipynb (1.0 MB)

Click here to view the jupyter notebook file in a new tab

1 Like

@kevindarley2024 good job :smiley: :+1:on your project. You’ve deviated from the instructions and taken your own route with respect to what needs to be considered in the vocabulary. I’m talking of the code below

business_phone = r"([0-9]{5}?)"
money_sign = r"([\$])"
euro_sign = r"([\£])"
semi_colon = r"([\:])"
exclamation_point = r"([\!])"
repeating_exclamation = r"([\!]{2,})"
elipsis = r"([\.]{2,})"
repeating_question = r"([\?]{2,})"

The following are the pointers I have:

Presentation Style
  • Seeing as this project mostly consists of terminal outputs you could format your output with color or boldening. e.g Your output for cell [21] could look like:
    Total number of events is 4458
    Check this for the same. This could help to differentiate your code from your output. This could even help to transfer some of your comments to your output.
  • In cell[18] the output is long. You could prevent that by setting the code as
word_counts.head()

instead of

print(word_counts.head())

This should give you a neat table. Otherwise you end up having to do a long scroll which can put off some readers.

Coding Style
  • You seem to have commented on each step trying to explain what you are trying to do like in cell [25]. This can be avoided in its entirety in the final iteration (i.e. when you clean up this project and put this out). Your current comments should help during that review. In the final version you could put down simple comments like *#calculate probabilities * for cell [24].
  • I feel its good practice that your round your outputs instead of outing the non-rounded values like in cell [11]. A simple .round() should help with this regard. Also since you are normalizing, it would be good to multiply it by 100 because non-technical readers could get confused.
Bugs/Inaccuracies
  • In the Results section you mention,

By doing this we were able to increase our algorithm’s accuracy by 1%.

But I could not find where you calculated an accuracy of 96%. This may be something I missed. I only say this because the only other accuracy I noticed was 80% in the Introduction section.

Miscellaneous
  • You’ve come so far. I encourage you to analyze why 3% of the data was incorrectly predicted.

Keep going! You are blazing :fire:

2 Likes

jesmaxavier,

Thank you so much for taking the time to review my project and provide actionable feedback!

I think I’ll go back add these to my project. For the bugs/inaccuracies section I should have included the output without optimizations first so I’ll add that in my final draft.

Thanks!
Kevin

1 Like