Information Gain - Confusion in Calculation

Screen Link:

image

Can someone please shed some light on how we arrived at the calculation, specifically 2/4 & 1/5

https://app.dataquest.io/m/89/introduction-to-decision-trees/10/information-gain

1 Like

Details on the formula itself can be found in my response here - Trouble understanding Information Gain formula

First, let’s look at 1/5. In the content, we are given -

for each unique value v in the variable A, we compute the number of rows in which A takes on the value v, and divide it by the total number of rows.

This is what \frac{|T_v|}{|T|} is.

What is A?

  • split_age

How many unique values are there in A?

  • 2; 0 and 1.

What is the number of rows in which A takes on the value 1?

  • 1.

What is the total number of rows?

  • 5.

So,

compute the number of rows in which A takes on the value v, and divide it by the total number of rows.

is 1/5.

Coming to 2/4.

This can be a bit confusing. But remember that we are calculating the entropy for T_v where v is a unique value in A.

Previously, we calculated the Entropy for T. So, we end up calculating the probability of all the unique values that are present in T. That’s how we get the 0.97. This was also covered in Step 8 of this Mission.

But Entropy(T_v) means we are only looking at the Entropy of T for when the value in A is either 0 or 1.

So, when v is 0 in A -

How many total rows in high_income (this is our T) correspond to rows in A with value 0?

  • 4.

These are the 4 rows -

How many of those 4 rows in high_income have the value 0?

  • 2.

How many of those 4 rows in high_income have the value 1?

  • 2.

So, what’s the probability of the value being 0 given the above?

  • 2/4

What’s the probability of the value being 1 given the above?

  • 2/4

You can similarly calculate the entropy of T for when v in A is 1. Which should be fairly simple.

1 Like