# Trouble understanding Information Gain formula

Hello everyone,

I’m struggling a little bit understanding the formula for Information Gain that is described in this screen.

The explanation says:

“To compute it, we first calculate the entropy for T.
Then, for each unique value v in the variable A, we compute the number of rows in which A takes on the value v, and divide it by the total number of rows. Next, we multiply the results by the entropy of the rows where A is v.
We add all of these subset entropies together, then subtract from the overall entropy to get information gain.”

The section of the explanation that I highlighted keeps mentioning A but all I see in the formula is T.
I’m sure the formula is right but I just can’t make sense of it quite yet.

Any help will be very much appreciated.
Cheers!

A is mentioned right above that paragraph -

as well as a given variable we want to split on (A).

In the example on that same page, age is that variable A. for each unique value v in the variable A

So, that’s going to be 25, 30, 50, 80

we compute the number of rows in which A takes on the value v

Only 50 in age is repeated twice. So, this step gives us, for 25, 30, 50, 80, number of rows which have those values - 1, 1, 2, 1

and divide it by the total number of rows.

Total number or rows is 5. So, 1, 1, 2, 1 becomes 0.2, 0.2, 0.4, 0.2

Next, we multiply the results by the entropy of the rows where A is v .

We calculate the entropy for each row corresponding to each unique age, and multiply each of that entropy to each of the corresponding values we obtained above 0.2, 0.2, 0.4, 0.2.

We add all of these subset entropies together,

After that, we just sum those final values up.

The steps above are what this part of the formula is doing - 3 Likes Can you please shed some light on how we arrived at the calculation, specifically 2/4 & 1/5.

1 Like

@the_doctor I’m pretty sure I understand the formula explanation with no problem, your comment above solidifies it. But that does lead to my question on the formula itself. First, I’m no expert on mathematical notations. I’m confused with the part that describes the weight in the formula – |Tv|/|T|, it’s not very intuitive to use the notation for target T here. To my understanding, it seems that |Av|/|A| would be more appropriate. I would really appreciate some clarification on this one. Thanks ahead!

I myself haven’t gone through all of that content yet. My answer above was based off of a relatively quick look.

So, I can’t quite comment on why such a notation was used as of now. I do recall feeling the same, though. I felt the notation was not that suitable for this, as you also point out. But once I actually go through that content I can be more sure. Unfortunately, that might take a long time. Perhaps best to ask a separate question and someone from DQ can shed some light on it.

1 Like

Can someone please explain how you get the 2/4*log2/4 part? The info gain formula notation is confusing with using T(subV)/T and T(subV) to represent the A variable. Also, the community moderator didn’t really explain how we got the 2/4 part addressed in beginning of this post. Wish you guys could explain this more clearly.

I’ll try to explain how to work through the notation. First, let’s set up our data:

>>> import pandas as pd
>>> data = {
... "age": [25, 50, 30, 50, 80],
... "high_income": [1, 1, 0, 0, 1],
... "split_age": [0, 0, 0, 0, 1]
... }
>>> df = pd.DataFrame(data)
>>> df
age  high_income  split_age
0   25            1          0
1   50            1          0
2   30            0          0
3   50            0          0
4   80            1          1


We will be looking at the formula. . .

\displaystyle IG(T,A) = \text{Entropy}(T)-\sum \limits_{v\in A} \left(\dfrac{|T_{v}|}{|T|} \cdot \text{Entropy}(T_{v})\right)

where A represents the (unique) values in split_age. In other words, A = \{0, 1\}. Thus, the summation. . .

\sum \limits_{v\in A} \left(\dfrac{|T_{v}|}{|T|} \cdot \text{Entropy}(T_{v})\right)

. . . can be unpacked as follows:

\begin{align} \sum \limits_{v\in A} \left(\dfrac{|T_{v}|}{|T|} \cdot \text{Entropy}(T_{v})\right) &= \sum \limits_{v\in \{0, 1\}} \left(\dfrac{|T_{v}|}{|T|} \cdot \text{Entropy}(T_{v})\right)\\ &= \dfrac{|T_{0}|}{|T|} \text{Entropy}(T_{0}) + \dfrac{|T_{1}|}{|T|} \text{Entropy}(T_{1}) \tag 1 \end{align}

Let’s dig in into the notation again. The symbol T denotes the whole dataset, while the symbol T_v represents the rows for the splitting value v that comes from split_age. In particular, and more explicitly, we have that:

• T_0 is the set of rows for which split_age equals 0:

>>> df[df["split_age"] == 0]
age  high_income  split_age
0   25            1          0
1   50            1          0
2   30            0          0
3   50            0          0

• T_1 is the set of rows for which split_age equals 1:

>>> df[df["split_age"] == 1]
age  high_income  split_age
4   80            1          1


The vertical bars with a set in-between them denote the number of elements of that set. Therefore we have:

• \left\vert T\right\vert = 5
• \left\vert T_0\right\vert = 4
• \left\vert T_1\right\vert = 1

Replacing back in (1) we obtain:

\dfrac{4}{5} \text{Entropy}(T_{0}) + \dfrac{1}{5} \text{Entropy}(T_{1}) \tag 2

Now let’s compute the remaining terms, namely \text{Entropy}(T_{0}) and \text{Entropy}(T_{1}).

In a previous screen, we can read that the entropy is given by \displaystyle -\sum \limits_{i=1}^{c} {\mathrm{P}(x_i) \log_2 \left(\mathrm{P}(x_i)\right)} where:

• x_1, \ldots, x_c are the unique values in our target variable (high_income) where . . .
• c is the number of unique values in our target colum

N.B.: If P(x_i) above is 0, then \log_2(x_i) isn’t meaningful. In this case, \mathrm{P}(x_i) \log_2 \left(\mathrm{P}(x_i)\right) is replaced by 0.

We thus have c=2, x_0 = 0 and x_1 = 1.

Finally, P(x_i) is the ratio between the number of times x_i occurs in the high_income column in S (where S is either T_0 or T_1) and the number of elements in S.

For T_0 we have P(x_0) = \dfrac{2}{4} and P(x_1) = \dfrac{2}{4}. Consequently:

\begin{align} \text{Entropy}(T_{0}) &= -\sum \limits_{i=1}^{2} {\mathrm{P}(x_i) \log_2 \left(\mathrm{P}(x_i)\right)}\\ &= - \left(\color{blue}{\left(\dfrac{2}{4}\log_2\left(\dfrac{2}{4}\right)\right)} \color{black}{+} \color{brown}{\left(\dfrac{2}{4}\log_2\left(\dfrac{2}{4}\right)\right)}\right) \end{align}

For T_1 we have P(x_0) = 0 and P(x_1) = 1, thus:

\begin{align} \text{Entropy}(T_{1}) &= -\sum \limits_{i=1}^{2} {\mathrm{P}(x_i) \log_2 \left(\mathrm{P}(x_i)\right)}\\ &= - \left(\color{blue}{0} \color{black}{+} \color{brown}{\left(1\cdot \log_2\left(1\right)\right)}\right) \end{align}

Replacing back in (2) we obtain:

-\dfrac{4}{5} \left(\color{blue}{\left(\dfrac{2}{4}\log_2\left(\dfrac{2}{4}\right)\right)} \color{black}{+} \color{brown}{\left(\dfrac{2}{4}\log_2\left(\dfrac{2}{4}\right)\right)}\right) - \dfrac{1}{5} \left(\color{blue}{0} \color{black}{+} \color{brown}{\left(1\cdot \log_2\left(1\right)\right)}\right)

The rest is just computations.

2 Likes

I hope this post helps.

I was a bit confused with this one as well. After mulling over it for a while, I’m wondering if the wording should be:

"To compute it, we first calculate the entropy for T .
Then, for each unique value v in the variable A, we compute the number of rows in which A takes on the value v, and divide it by the total number of rows. Next, we multiply the results by the entropy for T, of the rows where A is v.
We add all of these subset entropies together, then subtract from the overall entropy to get information gain.”

Where the text in bold is where I have made modifications.

For me that makes a bit more sense, but I admit, I could be completely wrong as I have little knowledge on the semantics for mathematical notation.

1 Like

Hi @peakelaw,

Could you please share your feedback with the Content & Product teams of Dataquest? Just click the ? button in the upper-right corner of any screen of the Dataquest learning platform, select Share Feedback, fill in the form, and send it. Thanks!

1 Like