Guided Project_ Building a Spam Filter with Naive Bayes

Hi everyone,

Wow, it seems that the topic of my another project is not super-popular in the Community! :grinning:

Anyway, it’s about a very necessary tool nowadays: a spam filter for SMS messages. It resulted to be quite a precise filter, with 98.74% of accuracy. However, making the system more complicated by considering letter case of the words made it much less accurate and, hence, this experiment was not considered further.

I also investigated those very few messages classified wrongly by the filter and found some features in common among them. It seems that spam senders have a clear idea of how spam filters work, so they also figured out the ways of how to override the system :smirk:

Additionally, I created a word cloud to display the 100 most “spamish” words and found patterns in them. Also, throughout the project, I used pretty-printing a lot, including how to better visualize numbers, tables, and sections. And for writing pieces of formulas in markdown, I found a good trick for displaying lower and upper indices (used only the first one). If you need to know it as well, ask me :yum: Hopefully, you’ll find these things useful for your future projects as well.

Any feedback from you is very welcome. What can be improved / modified / optimized in terms of code, storytelling, project structure? Or if you find any typos, discrepancies, issues, etc., please let me know.

Thanks a lot in advance!

https://app.dataquest.io/m/433/guided-project%3A-building-a-spam-filter-with-naive-bayes/10/next-steps

Building a Spam Filter with Naive Bayes.ipynb (572.9 KB)

Click here to view the jupyter notebook file in a new tab

9 Likes

This is such an excellent project. I found this so motivating and more importantly, I enjoyed your additionally effort on the wrongly classified texts. Thanks so much for sharing this.

2 Likes

Thanks a lot for your appreciation @sulaiman2001ng1, I’m very glad that my project was useful! :heavy_heart_exclamation:

Hi!
I have a problem with both the Spam Filter project and the solution proposed here: Python and/or Pandas complain about the step
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
The error message (which I get both on my own attempts and on the solution notebook linked above) leaves me with the impression that either or Pandas or Python have become more picky with the slicing syntax. Any ideas how to resolve this? Thanks in advance for any suggestions :wink:

1 Like

Hi @atonalactuary,

Welcome to the Community!

Your code looks ok, indeed I used the same piece of code in this my project :slightly_smiling_face: Let’s try the following thing: open a new Jupyter notebook in the same folder, copy there all the code cells from the solution notebook up to and including the line above, and run this “draft” notebook. Does the error still appear? What exactly does that error say, by the way?

Hi @Elena_Kosourova, thanks a lot for your kind reply!

When I open the link https://app.dataquest.io/m/433/guided-project%3A-building-a-spam-filter-with-naive-bayes/10/next-steps in a new window in my web browser, I get the following error message from the code chunk with the string replacement code:

## my code:
data_train['SMS'] = data_train['SMS'].str.replace('\W', ' ')
data_train['SMS'] = data_train['SMS'].str.lower()
data_train.head()

## error message:
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

(I am using the Dataquest system from a Mac with the current macOS – but I guess that this should not make a difference here, since the Python computing is happening in the cloud anyways?)

Thanks again!

1 Like

Hi @atonalactuary,

Are you working on your project in the DQ platform or from your local machine from Anaconda? It can be useful for debugging this error to run a new notebook on your local machine, where you should insert all the code copied from the solution notebook up to and including the problematic code line. In this way, you could easily see where the issue is.

Anyway, I strongly suspect that the issue appeared earlier, not exactly in that line. Could you please share your previous code? Maybe in the form of the draft notebook. I’ll take a look and let you know.

Hi @Elena_Kosourova, thanks again for your kind reply. There was in fact a problem in some previous code – to be precise, when doing reset_index(), I did not do drop = True. Debugging stuff like this reminds me of how much I miss the RStudio tools when exploring the brave new world of Python …

(To answer your question: I am using the DQ platform for the guided projects.)

Thanks again!

Hi @atonalactuary,

That’s great that you found and fixed the issue!

By the way, I’d suggest you in future to consider working on your guided projects from your local machine, on Anaconda. I’m saying it from my personal experience, as well as from other learners’ experience, judging by the issues mentioned here in the Community :slightly_smiling_face: Some projects are notoriously glitchy when run from the DQ platform (for example, the one on NYC schools), up to losing all the work done at some point. So it’s much safer and faster to work from Anaconda.

Happy learning! :nerd_face:

1 Like

Hi @Elena_Kosourova it is good to learn a new skill like word cloud from your project. I tried to do it on my project but it does not work… I used the DQ platform, it does not allow me to import word could…

I tried to insert a photo of the formula into the notebook, it does not work either… can you help with it? Thank you!!

1 Like

Hi @candiceliu93,

I suggest you to use word cloud library on your local machine from Anaconda, it works perfectly. On DQ platform, unfortuunately, you can’t load it.

As for inserting a photo of the formula (and any other photo), once I had the same question. Look what Wilfried suggested to me:

About inserting photo, can you be more specific? I just need to copy below code and select markdown then drag the photo in?

map

You should create a new markdown cell in your project, without writing anything there. Then, drag and drop any screenshot or photo from your computer into it. The information about the file of that photo (i.e. the file name and expansion) will appear in that markdown cell. Now, just run it.