Question on Introduction to Decision Trees

Hi I am at Introduction to Decision Trees, step 10. As shown the picture attached, I am having trouble figuring out how the numbers came about in the example. split_age contains 4 zeros and one 1, but in the part that calculates Tv/T, it used 2/4 for example. Could someone explain how how these two values are determined?


Hey @xuehong.liu.pdx,

From the example, the text is given:

Let’s say we wanted to split this data set based on age. First, we calculate the median age, which is 50 . Then, we assign any row with a value less than or equal to the median age the value 0 (in a new column named split_age ), and the other rows 1 .

Here is what the text is doing:

median_age = income["age"].median()

left_split = income[income["age"] <= median_age]
right_split = income[income["age"] > median_age]

#Suppose we have a split dataframe df
# df can be either left_split or right_split

# We will create a new column split age 
df["split_age"] = df["age"]

# Create a binary classifier by using a age median value 
df[df["split_age"] > median] = 1
df[df["split_age"] <= median] = 0

where left_split has 2 ones and 2 zeros in the high_income column
where right_split has 1 ones and 0 zeros in the high_income column

Let L be the sample of high_income in left_split.
Let R be the sample of high_income in right_split.
Let N be the sample of L + R.

Then we have,

L = [0, 0, 1, 1]
R = [1]
N = [0, 0, 1, 1, 1]

Sample size of L is 4 and has 2 zeros and 2 ones.
Sample size of R is 1 and has 0 zeros and 1 ones.
Sample size of N is 5 and has 2 zeros and 3 ones.

Probability of x in high income = number of x / total high income

prob = lambda x: income[income["high_income"] == x].shape[0]/income.shape[0]

Using lambda function to compute probability a, b, c,

where a = P(x=0 in L) and b = P(x=1 in L)
and, where c = P(x=0 in R) and d = P(x=1 in R)

a, b = prob(0 in L)= 2/4 , prob(1 in L) = 2/4
c, d = prob(0 in R)= 0/1 = 0 , prob(1 in R) = 1/1 = 1

Using the entropy formula for left and right split:

left_split_income_entropy = -(a * math.log(a, 2) + b * math.log(b,2))  
right_split_income_entropy = -(c * math.log(c, 2) + d * math.log(d,2))

Substitute a, b, c, d using a = 2/4, b = 2/4, c = 0, and d = 1

left_split_income_entropy = -(2/4 * math.log(a, 2/4) + 2/4 * math.log(2/4,2))  
right_split_income_entropy = -(0 * math.log(0, 2) + 1 * math.log(1,2))

In order to compute T_V/T, we have to compute the probability of either right or left split:

Probability of type of split = number elements in split / total elements in population

Let x be probability of left split, P( L ).
Let y be probability of right split, P( R ).

From observation, L has a sample size of 4 and R has a sample size of 1. Thus, we have:

Sample size of N = Sample size of L + Sample Size of R = 5
x = P( L ) = Sample size of L / Sample of size of N = 4/5
y = P( R ) = Sample size of R / Sample of size of N = 1/5

Then we can compute Information Gain, IG by plugging the formula:

IG = total entropy - (P(L) * entropy( L ) + P( R ) * entropy( R ))

where entropy( L ) = left_split_income_entropy
and where entropy( R ) = right_split_income_entropy

Hope it helps.



Ah, Thank you so much alvinctk, for taking the time explaining. Finally figured out what’s happening here.


You are welcome. I guess my solution isn’t well liked. :smile:

1 Like

Hi I am at step 5 of “Building a Decision Tree”-- I am confused by the answer after split the data at the best_column (see below). I expected the nested call to ID3 to be on left_split, then right_split data sets, rather than [left_split, right_split]. The split seems to be an unnecessary step if it directly followed by " for split in [left_split, right_split]:"


left_split = data[data[best_column] <= column_median]
right_split = data[data[best_column] > column_median]

for split in [left_split, right_split]:
    id3(split, target, columns)

Hey @xuehong.liu.pdx,

Sorry, I won’t be able to answer as I will be focusing my time to study for Google interview.

I will leave the post open for others to respond.

If no one respond to your post, you can email the Dataquest team for help via emailing them at [email protected].


@alvinctk OK, thanks and good luck!

1 Like

Excellent explanation with the good example