27-4 Naive Bayes for Sentiment Analysis

In Dataquest’s Naive Bayes for Sentiment Analysis mission, screen 4;
there is an equation to predict as following using Laplace smoothing

    def make_class_prediction(text, counts, class_prob, class_count):
        prediction = 1
        text_counts = Counter(re.split("\s+", text))
        for word in text_counts:
            # For every word in the text, we get the number of times that word occurred in the reviews for a given class, add 1 to smooth the value, and divide by the total number of words in the class (plus the class_count, also to smooth the denominator)
            # Smoothing ensures that we don't multiply the prediction by 0 if the word didn't exist in the training data
            # We also smooth the denominator counts to keep things even
            prediction *=  text_counts.get(word) * ((counts.get(word, 0) + 1) / (sum(counts.values()) + class_count))
        # Now we multiply by the probability of the class existing in the documents
        return prediction * class_prob

using

print("Negative prediction: {0}".format(make_class_prediction(reviews[0][0], negative_counts, prob_negative, negative_review_count)))

My questions are

  1. When Lapace Smoothing is applied in the predict equation we add 1 (k=1) and in the denominator it should be as if we add one word to the who dataset so why is there (negative_review_count i.e number of negative reviews in the reviews array and has no count of the words in it)

  2. When we add the extra count of words should that only be adding the count of all negative + positive both words right?

Could someone please help clarify? Thanks

1 Like

Hey, Ankit.

The takeaway from my reply is: you’re absolutely right. Nice job catching these.

First question

Let’s focus on the following sentence.

In the denominator it should be as if we add one word to the who dataset

Actually, when we try to make it smooth, what we want to do is pretend that each word occurred once more than it actually did.

Let’s pretend there were ten different words and that the word like didn’t occur in the negative reviews.

Just because it didn’t come up in the negative reviews up until now, it doesn’t mean that it never will. So, to account for this we add 1 in the numerator.

To compensate for this, we pretend that each word occurred one more time than it did (why should like be privileged?) So we add 10 do the denominator (because in this imaginary example we only have ten words).

Second question

Right. It should be the total number of words and not negative_review_count, nor positive_review_count.

Curiosity and intuition

Just because it didn’t come up in the negative reviews up until now, it doesn’t mean that it never will.

This is known as the Sunrise Problem. If we take a look at the past, the probability that the sun will rise tomorrow is 100%. We know this isn’t true, so to account for this we can apply Laplace smoothing and add one fictitious data point to our training data where the sun didn’t rise.

2 Likes