# Question on Introduction to Decision Trees

Hi I am at Introduction to Decision Trees, https://app.dataquest.io/m/89/introduction-to-decision-trees/10/information-gain step 10. As shown the picture attached, I am having trouble figuring out how the numbers came about in the example. split_age contains 4 zeros and one 1, but in the part that calculates Tv/T, it used 2/4 for example. Could someone explain how how these two values are determined?

Thanks

Hey @xuehong.liu.pdx,

From the example, the text is given:

Letâ€™s say we wanted to split this data set based on age. First, we calculate the median age, which is `50` . Then, we assign any row with a value less than or equal to the median age the value `0` (in a new column named `split_age` ), and the other rows `1` .

Here is what the text is doing:

``````median_age = income["age"].median()

left_split = income[income["age"] <= median_age]
right_split = income[income["age"] > median_age]

#Suppose we have a split dataframe df
# df can be either left_split or right_split

# We will create a new column split age
df["split_age"] = df["age"]

# Create a binary classifier by using a age median value
df[df["split_age"] > median] = 1
df[df["split_age"] <= median] = 0
``````

where left_split has 2 ones and 2 zeros in the high_income column
where right_split has 1 ones and 0 zeros in the high_income column

Let L be the sample of high_income in left_split.
Let R be the sample of high_income in right_split.
Let N be the sample of L + R.

Then we have,

L = [0, 0, 1, 1]
R = [1]
N = [0, 0, 1, 1, 1]

Sample size of L is 4 and has 2 zeros and 2 ones.
Sample size of R is 1 and has 0 zeros and 1 ones.
Sample size of N is 5 and has 2 zeros and 3 ones.

Probability of x in high income = number of x / total high income

``````prob = lambda x: income[income["high_income"] == x].shape[0]/income.shape[0]
``````

Using lambda function to compute probability a, b, c,

where a = P(x=0 in L) and b = P(x=1 in L)
and, where c = P(x=0 in R) and d = P(x=1 in R)

a, b = prob(0 in L)= 2/4 , prob(1 in L) = 2/4
c, d = prob(0 in R)= 0/1 = 0 , prob(1 in R) = 1/1 = 1

Using the entropy formula for left and right split:

``````left_split_income_entropy = -(a * math.log(a, 2) + b * math.log(b,2))
right_split_income_entropy = -(c * math.log(c, 2) + d * math.log(d,2))
``````

Substitute a, b, c, d using a = 2/4, b = 2/4, c = 0, and d = 1

``````left_split_income_entropy = -(2/4 * math.log(a, 2/4) + 2/4 * math.log(2/4,2))
right_split_income_entropy = -(0 * math.log(0, 2) + 1 * math.log(1,2))
``````

In order to compute T_V/T, we have to compute the probability of either right or left split:

Probability of type of split = number elements in split / total elements in population

Let x be probability of left split, P( L ).
Let y be probability of right split, P( R ).

From observation, L has a sample size of 4 and R has a sample size of 1. Thus, we have:

Sample size of N = Sample size of L + Sample Size of R = 5
x = P( L ) = Sample size of L / Sample of size of N = 4/5
y = P( R ) = Sample size of R / Sample of size of N = 1/5

Then we can compute Information Gain, IG by plugging the formula:

IG = total entropy - (P(L) * entropy( L ) + P( R ) * entropy( R ))

where entropy( L ) = left_split_income_entropy
and where entropy( R ) = right_split_income_entropy

Hope it helps.

Best,
Alvin.

2 Likes

Ah, Thank you so much alvinctk, for taking the time explaining. Finally figured out whatâ€™s happening here.

Best,
Xuehong

You are welcome. I guess my solution isnâ€™t well liked.

1 Like

Hi I am at step 5 of â€śBuilding a Decision Treeâ€ť-- https://app.dataquest.io/m/90/building-a-decision-tree/5/creating-a-simple-recursive-algorithm. I am confused by the answer after split the data at the best_column (see below). I expected the nested call to ID3 to be on left_split, then right_split data sets, rather than [left_split, right_split]. The split seems to be an unnecessary step if it directly followed by " for split in [left_split, right_split]:"

Thanks

``````left_split = data[data[best_column] <= column_median]
right_split = data[data[best_column] > column_median]

for split in [left_split, right_split]:
id3(split, target, columns)``````

Hey @xuehong.liu.pdx,

Sorry, I wonâ€™t be able to answer as I will be focusing my time to study for Google interview.

I will leave the post open for others to respond.

If no one respond to your post, you can email the Dataquest team for help via emailing them at [email protected].

Best,
Alvin.

@alvinctk OK, thanks and good luck!
Xuehong

1 Like

Excellent explanation with the good example