# 94-7: Problem with calc_information_gain

My Code:

``````def find_best_column(data, target_name, columns):
information_gains = []

cols = numpy.random.choice(columns, 2)

for col in cols:
information_gain = calc_information_gain(data, col, "high_income")
information_gains.append(information_gain)

# Find the name of the column with the highest gain
highest_gain_index = information_gains.index(max(information_gains))
highest_gain = columns[highest_gain_index]
print('Splitting on:', highest_gain)
print('\nBelow median, left')
print('---------------')
print(data[data[highest_gain] <= data[highest_gain].median()])
print('\nAbove median, right')
print('---------------')
print(data[data[highest_gain] > data[highest_gain].median()])
return highest_gain
``````
``````def test(data, target, columns):
unique_targets = pandas.unique(data[target])
nodes.append(len(nodes) + 1)
tree["number"] = nodes[-1]

if len(unique_targets) == 1:
if 0 in unique_targets:
tree["label"] = 0
elif 1 in unique_targets:
tree["label"] = 1
return

print('ROOT')
bc = find_best_column(data, target, columns)
left_br = data[data[bc] <= data[bc].median()]
right_br = data[data[bc] > data[bc].median()]

print('\n\nLEFT BRANCH')
lbc = find_best_column(left_br, target, columns)
print('\n\nRIGHT BRANCH')
rbc = find_best_column(right_br, target, columns)

test(data, "high_income", ["employment", "age", "marital_status"])
``````

What I expected to happen:
Modifying `find_best_column()` as instructed by the exercise (adding `cols = numpy.random.choice(columns, 2)` and looping through `cols`) should run successfully and produce the correct output for this task.

What actually happened:
First the algorithm splits the dataset on the “employment” column, and then in the right branch it determines that employment category gives the greatest information gain at this node too. This subset of the data all has the same value for employment, so the split results in one dataframe with 3 rows in it (all with an employment category of 5) and one empty dataframe. In the next round of splits, we then pass this empty dataframe into `calc_information_gain()`, where it attempts to divide by the number of rows in the dataframe, resulting in a division by zero error.

Running the code in the above snippets will print the data at each split to illustrate the issue.

The error can be avoided by including this in the `id3()` definition:

``````def id3(data, target, columns, tree):
...
tree["median"] = column_median

unique_values = pandas.unique(data[best_column])
if len(unique_values) == 1:
tree["label"] = numpy.round(data[target].mean())
return

left_split = data[data[best_column] <= column_median]
...
``````

However, this produces a result which doesn’t match the expected output for the exercise.

Is there a way to avoid this error and produce the correct output which I have missed?

You’ve probably figured it out by now, but I think the problem is on this line:
highest_gain = columns[highest_gain_index]

since the `information_gains` list contains the gains for the randomly selected columns in `cols`.