Hey @xuehong.liu.pdx,
From the example, the text is given:
Let’s say we wanted to split this data set based on age. First, we calculate the median age, which is 50
. Then, we assign any row with a value less than or equal to the median age the value 0
(in a new column named split_age
), and the other rows 1
.
Here is what the text is doing:
median_age = income["age"].median()
left_split = income[income["age"] <= median_age]
right_split = income[income["age"] > median_age]
#Suppose we have a split dataframe df
# df can be either left_split or right_split
# We will create a new column split age
df["split_age"] = df["age"]
# Create a binary classifier by using a age median value
df[df["split_age"] > median] = 1
df[df["split_age"] <= median] = 0
where left_split has 2 ones and 2 zeros in the high_income column
where right_split has 1 ones and 0 zeros in the high_income column
Let L be the sample of high_income in left_split.
Let R be the sample of high_income in right_split.
Let N be the sample of L + R.
Then we have,
L = [0, 0, 1, 1]
R = [1]
N = [0, 0, 1, 1, 1]
Sample size of L is 4 and has 2 zeros and 2 ones.
Sample size of R is 1 and has 0 zeros and 1 ones.
Sample size of N is 5 and has 2 zeros and 3 ones.
Probability of x in high income = number of x / total high income
prob = lambda x: income[income["high_income"] == x].shape[0]/income.shape[0]
Using lambda function to compute probability a, b, c,
where a = P(x=0 in L) and b = P(x=1 in L)
and, where c = P(x=0 in R) and d = P(x=1 in R)
a, b = prob(0 in L)= 2/4 , prob(1 in L) = 2/4
c, d = prob(0 in R)= 0/1 = 0 , prob(1 in R) = 1/1 = 1
Using the entropy formula for left and right split:
left_split_income_entropy = -(a * math.log(a, 2) + b * math.log(b,2))
right_split_income_entropy = -(c * math.log(c, 2) + d * math.log(d,2))
Substitute a, b, c, d using a = 2/4, b = 2/4, c = 0, and d = 1
left_split_income_entropy = -(2/4 * math.log(a, 2/4) + 2/4 * math.log(2/4,2))
right_split_income_entropy = -(0 * math.log(0, 2) + 1 * math.log(1,2))
In order to compute T_V/T, we have to compute the probability of either right or left split:
Probability of type of split = number elements in split / total elements in population
Let x be probability of left split, P( L ).
Let y be probability of right split, P( R ).
From observation, L has a sample size of 4 and R has a sample size of 1. Thus, we have:
Sample size of N = Sample size of L + Sample Size of R = 5
x = P( L ) = Sample size of L / Sample of size of N = 4/5
y = P( R ) = Sample size of R / Sample of size of N = 1/5
Then we can compute Information Gain, IG by plugging the formula:
IG = total entropy - (P(L) * entropy( L ) + P( R ) * entropy( R ))
where entropy( L ) = left_split_income_entropy
and where entropy( R ) = right_split_income_entropy
Hope it helps.
Best,
Alvin.