def find_best_column(data, target_name, columns): information_gains =  cols = numpy.random.choice(columns, 2) for col in cols: information_gain = calc_information_gain(data, col, "high_income") information_gains.append(information_gain) # Find the name of the column with the highest gain highest_gain_index = information_gains.index(max(information_gains)) highest_gain = columns[highest_gain_index] print('Splitting on:', highest_gain) print('\nBelow median, left') print('---------------') print(data[data[highest_gain] <= data[highest_gain].median()]) print('\nAbove median, right') print('---------------') print(data[data[highest_gain] > data[highest_gain].median()]) return highest_gain
def test(data, target, columns): unique_targets = pandas.unique(data[target]) nodes.append(len(nodes) + 1) tree["number"] = nodes[-1] if len(unique_targets) == 1: if 0 in unique_targets: tree["label"] = 0 elif 1 in unique_targets: tree["label"] = 1 return print('ROOT') bc = find_best_column(data, target, columns) left_br = data[data[bc] <= data[bc].median()] right_br = data[data[bc] > data[bc].median()] print('\n\nLEFT BRANCH') lbc = find_best_column(left_br, target, columns) print('\n\nRIGHT BRANCH') rbc = find_best_column(right_br, target, columns) test(data, "high_income", ["employment", "age", "marital_status"])
What I expected to happen:
find_best_column() as instructed by the exercise (adding
cols = numpy.random.choice(columns, 2) and looping through
cols) should run successfully and produce the correct output for this task.
What actually happened:
First the algorithm splits the dataset on the “employment” column, and then in the right branch it determines that employment category gives the greatest information gain at this node too. This subset of the data all has the same value for employment, so the split results in one dataframe with 3 rows in it (all with an employment category of 5) and one empty dataframe. In the next round of splits, we then pass this empty dataframe into
calc_information_gain(), where it attempts to divide by the number of rows in the dataframe, resulting in a division by zero error.
Running the code in the above snippets will print the data at each split to illustrate the issue.
The error can be avoided by including this in the
def id3(data, target, columns, tree): ... tree["median"] = column_median unique_values = pandas.unique(data[best_column]) if len(unique_values) == 1: tree["label"] = numpy.round(data[target].mean()) return left_split = data[data[best_column] <= column_median] ...
However, this produces a result which doesn’t match the expected output for the exercise.
Is there a way to avoid this error and produce the correct output which I have missed?