So I’m learning about decision trees, and to pick the feature to split on, we find the one with the most information gain. The formula for information gain is this:
As explained in the DQ course:
We’re computing information gain (IG) for a given target variable (T), as well as a given variable we want to split on (A).
To compute it, we first calculate the entropy for T. Then, for each unique value v in the variable A, we compute the number of rows in which A takes on the value v, and divide it by the total number of rows. Next, we multiply the results by the entropy of the rows where A is v. We add all of these subset entropies together, then subtract from the overall entropy to get information gain.
Here comes my question:
First, I’m no expert on mathematical notations. I’m confused with the part that describes the weight in the formula –
|Tv|/|T| , it’s not very intuitive to use the notation for target
T here. To my understanding, it seems that
|Av|/|A| would be more appropriate.
I would really appreciate some clarification on this one. Thanks ahead!