Entropy formula - why use 2 3 5?

Hey @xuehong.liu.pdx,

As the community is moving forward from slack to discourse, you are encourage to post your question in this discourse community.

Answering your question from slack:

Hi I am at Introduction to Decision Trees step 9. “https://app.dataquest.io/m/89/introduction-to-decision-trees/9/overview-of-data-set-entropy”. Could someone explain why in the following equation “entropy = -(2/5 * math.log(2/5, 2) + 3/5 * math.log(3/5, 2))”, numbers 2, 3 and 5 were used? This makes sense for the example at step 8 where there were two 0’s and three 1’s. But for the problem at step 9, I checked, there are 24720 0’s and 7841 1’s in the income[“high_income”] column.

I strongly believed your problem statement questioned what’s the link between step 8 and 9. The only link is that they are trying to explain the entropy formula on how to calculate it. Step 8 is just an example. Step 8 and 9 have different sample size.

For detailed explanation on step 8 and step 9, see below.

For Step 8:

Why 2, 3, 5?

Given a sample of 5 high income

age high_income
25 1
50 1
30 0
50 0
80 1

The example gives us a sample of high_income = [1, 1, 1, 0, 0].
There are 3 ones and 2 zeros.

We observed the following:

sample size n = 5

P(x=1) = total number of 1 / sample size = 3/5

P(x=0) = total number of 0 / sample size = 2/5

The problem choose to use log base two for the entropy formula since there are only two possible outcomes.

a = P(x=1) = 3/5
b = P(x=0) = 2/5

Using the entropy formula:

entropy = -(a * math.log(a, 2) + b * math.log(b,2))

That’s how 2, 3, 5 comes about.

For Step 9:

Probability of x in high income = number of x / total high income

prob = lambda x: income[income["high_income"] == x].shape[0]/income.shape[0]

Using lambda equation above to compute probability of a and b, where a = P(x=0) and b = P(x=1)

a, b = prob(0), prob(1)

Using the entropy formula:

income_entropy = -(a * math.log(a, 2) + b * math.log(b,2))

You can compute the number of 1s, and 0s using the probability.

print(a*income.shape[0])

0s = 24720.0

print(b*income.shape[0])

1s = 7841.0

And, yes, you are correct for having 24720 zeros and 7841 ones.

Hope it helps! Let me know if you need further help.

Thanks for clarify. It would be better to delete “entropy = -(2/5 * math.log(2/5, 2) + 3/5 * math.log(3/5, 2))” from the answer since it is not part of the solution and it was already in the instruction.

1 Like

Hey @xuehong.liu.pdx,

We have a solved feature that allows you the ability to mark something as the “correct” answer, which helps future students with the same question quickly find the solution they’re looking for.

Here’s an article on how to mark posts as solved - I don’t want to do this for you until I know that solution/explanation works.