Hey @xuehong.liu.pdx,

From the example, the text is given:

Letâ€™s say we wanted to split this data set based on age. First, we calculate the median age, which is `50`

. Then, we assign any row with a value less than or equal to the median age the value `0`

(in a new column named `split_age`

), and the other rows `1`

.

Here is what the text is doing:

```
median_age = income["age"].median()
left_split = income[income["age"] <= median_age]
right_split = income[income["age"] > median_age]
#Suppose we have a split dataframe df
# df can be either left_split or right_split
# We will create a new column split age
df["split_age"] = df["age"]
# Create a binary classifier by using a age median value
df[df["split_age"] > median] = 1
df[df["split_age"] <= median] = 0
```

where left_split has 2 ones and 2 zeros in the high_income column

where right_split has 1 ones and 0 zeros in the high_income column

Let L be the sample of high_income in left_split.

Let R be the sample of high_income in right_split.

Let N be the sample of L + R.

Then we have,

L = [0, 0, 1, 1]

R = [1]

N = [0, 0, 1, 1, 1]

Sample size of L is 4 and has 2 zeros and 2 ones.

Sample size of R is 1 and has 0 zeros and 1 ones.

Sample size of N is 5 and has 2 zeros and 3 ones.

Probability of x in high income = number of x / total high income

```
prob = lambda x: income[income["high_income"] == x].shape[0]/income.shape[0]
```

Using lambda function to compute probability a, b, c,

where a = P(x=0 in L) and b = P(x=1 in L)

and, where c = P(x=0 in R) and d = P(x=1 in R)

a, b = prob(0 in L)= 2/4 , prob(1 in L) = 2/4

c, d = prob(0 in R)= 0/1 = 0 , prob(1 in R) = 1/1 = 1

Using the entropy formula for left and right split:

```
left_split_income_entropy = -(a * math.log(a, 2) + b * math.log(b,2))
right_split_income_entropy = -(c * math.log(c, 2) + d * math.log(d,2))
```

Substitute a, b, c, d using a = 2/4, b = 2/4, c = 0, and d = 1

```
left_split_income_entropy = -(2/4 * math.log(a, 2/4) + 2/4 * math.log(2/4,2))
right_split_income_entropy = -(0 * math.log(0, 2) + 1 * math.log(1,2))
```

In order to compute T_V/T, we have to compute the probability of either right or left split:

Probability of type of split = number elements in split / total elements in population

Let x be probability of left split, P( L ).

Let y be probability of right split, P( R ).

From observation, L has a sample size of 4 and R has a sample size of 1. Thus, we have:

Sample size of N = Sample size of L + Sample Size of R = 5

x = P( L ) = Sample size of L / Sample of size of N = 4/5

y = P( R ) = Sample size of R / Sample of size of N = 1/5

Then we can compute Information Gain, IG by plugging the formula:

IG = total entropy - (P(L) * entropy( L ) + P( R ) * entropy( R ))

where entropy( L ) = left_split_income_entropy

and where entropy( R ) = right_split_income_entropy

Hope it helps.

Best,

Alvin.