Can you really take log2 of 0?

Link to mission

When computing the information gain in the “Introduction to Decision Trees” mission, the example has the case where there are no values in the split that have the target value. So when you use the equation to calculate entropy for the 1/5 case and there are none with 0 in the target, you have to take log2 of 0. Is this mathematically valid?

I thought that the outcome of that would be undefined, but the mission suggests that log2(0) is not undefined…

I think it is 0 * log2(0) that is equals to 0. This is an interesting question.

Still though, is 0 * undefined mathematically valid?

This is why it is interesting. I eagerly await the perfect response to this.

Hey @ferchenkyle

Welcome to the Dataquest Community.

Yes, calculating log of zero will give you -inf (negative infinity). And it will not work fine in equations.
In the mission, it shows you how you calculate IG (Information Gain) not suggest you to calculate log of zero.

At the time of implementation you need to take care of it which is in the next slide here in the implementation of calc_entropy function.

I hope this helps you.

I already understand that. My concern wasn’t getting through the mission, it is to know if the equations used for entropy and information gain are valid. The way the function calc_entropy() in the mission works around this problem is to ignore it if the probability is 0, only doing the calculation if the probability is >0:

entropy = 0
# Loop through the probabilities, and add each one to the total entropy
for prob in probabilities:
    if prob > 0:
        entropy += prob * math.log(prob, 2)

And making the value 0 if the probability is 0.

I was hoping there was some background source in information theory that explains why this assumption is valid.

Hi @ferchenkyle,

I guess these three links can help:

Now it’s possible that for some terms, 𝑝(𝑥)=0. In that case the value of that term would normally be undefined; however intuitively, we should be able to ignore probability zero outcomes, and mathematically, while the term is undefined, the limit when the probability approaches zero is 0. For those reasons, such terms are equated to zero.

To this end, textbooks often simply define 0log0=0.

Another way to see this is to look at any particular event, say flipping a coin, and adding an extra event with probability 0. For example, with probability 0.5, the coin turns up heads; with probability 0.5 it turns up tails, and with probability 0 it turns into the a flying purple elephant. The actual experiment has not changed, so the total entropy should not change; this means the entropy added by the elephant should indeed be 0.

This is why we indeed define the Shannon entropy of events that have zero probability to be 0.
What is the Shannon entropy of a zero probability event?

In the case of P( x i ) = 0 for some i , the value of the corresponding summand 0 log b (0) is taken to be 0, which is consistent with the limit:

\lim _{{p\to 0^{+}}}p\log(p)=0.

Entropy (information theory)

Ignore the zero probabilities, and carry on summation using the same equation.
Alternative to Shannon’s entropy when probability equal to zero

1 Like

It absolutely isn’t valid. It is mathematically incorrect.

Expressions involving invalid “entities” are invalid. So how why does this work?

The reason is that the symbol — rather the string of symbols — 0\cdot \log_2\left(0\right) takes a modified meaning of that that one would expect.

Even though it is meaningless to multiply zero with something that doesn’t exist, it is true that \lim \limits_{x\to 0}\left(x\cdot \log_2\left(x\right)\right) equals 0. Therefore, purely to use convenient notation, we define 0\cdot \log_2\left(0\right) as 0.

It may look silly to write 0\cdot \log_2\left(0\right) instead of 0, and it is. The reason why the notation is convenient is because it is easier to write \displaystyle \sum\limits_{v\in A}\frac{|T_{v}|}{|T|} \cdot \text{Entropy}(T_{v}) \nonumber with the understanding that when \vert T_v\vert is zero, we mean what I explained in the paragraphs above.

Handling this case separately would make the notation heavier.


Thanks, I see it can be understood using L’Hospital’s Rule