Build a spam filter with Naive Bayes

For this project the dataset is divided into train set and test set. But before that, randomisation is done. The percentage of spam and ham in train and test set is same even if randomisation is not done. In that case why is randomisation done at all?

Link :
https://app.dataquest.io/m/433/guided-project%3A-building-a-spam-filter-with-naive-bayes/3/letter-case-and-punctuation

Consider an alternative example.

You have a dataset of images of 2 fruits - Apple and Banana. You have a total of 10 images, 5 Bananas and 5 Apples. So, the dataset is -

Banana, Banana, Banana, Banana, Banana, Apple, Apple, Apple, Apple, Apple

You need to have a training set and a test set. The test set is to only ensure that you can, as the name suggests, test the model you trained to be able to identify if the image is of an Apple or a Banana.

You decide your training set will have 7 images, and your test set will have 3 images.

This ends up being your training set -

Banana, Banana, Banana, Banana, Banana, Apple, Apple

So, your test set would be

Apple, Apple, Apple

Do you notice the problem?

Your training set has only 2 instances of Apple. And the rest are all Bananas. Your model is likely to end up learning more about Bananas than Apples. What happens when there are 1000 of images, 500 Bananas and 500 Apples?

So, when you use a test set, it might lead to poor performance because it is likely to assume that an Apple might be a Banana. Because the model didn’t properly learn from the data to differentiate between a Banana and an Apple.

This is a very simplified example. This depends on how your dataset is structured and how many samples of each category are present in your dataset as well.

But that’s where randomization can help. By randomizing, you try to ensure the above situation is unlikely to happen especially for large datasets. If you randomize the above scenario, you might get a training set that has 4 Bananas and 3 Apples. Slightly better. But with a dataset with 1000s of images, it’s even a better distribution of your categories

It’s essentially to ensure that when we train a model, then one particular category doesn’t “overwhelm” the model so that the model ends up learning more about that one category instead of the other(s).

The same is true for the test set. If your test set contained mostly of one category, then you might not understand how well your model is doing.

In this guided project, we have a lot more ham than spam. And given how the dataset is structured, it’s not as neatly ordered as the example I presented above. So, randomization might not have that much of an impact.

But, in general, randomization of the dataset before dividing into training and test set is mostly preferable. Do note, that this depends on the type of dataset as well. I won’t get into more details as you might learn about this slowly, over time. Hopefully the above helps in some way.

2 Likes