94-7: Problem with calc_information_gain

,

Screen Link:
https://app.dataquest.io/m/94/introduction-to-random-forests/7/selecting-random-features

My Code:

def find_best_column(data, target_name, columns):
    information_gains = []
    
    cols = numpy.random.choice(columns, 2)
    
    for col in cols:        
        information_gain = calc_information_gain(data, col, "high_income")
        information_gains.append(information_gain)

    # Find the name of the column with the highest gain
    highest_gain_index = information_gains.index(max(information_gains))
    highest_gain = columns[highest_gain_index]
    print('Splitting on:', highest_gain)
    print('\nBelow median, left')
    print('---------------')
    print(data[data[highest_gain] <= data[highest_gain].median()])
    print('\nAbove median, right')
    print('---------------')
    print(data[data[highest_gain] > data[highest_gain].median()])
    return highest_gain
def test(data, target, columns):
    unique_targets = pandas.unique(data[target])
    nodes.append(len(nodes) + 1)
    tree["number"] = nodes[-1]
    
    if len(unique_targets) == 1:
        if 0 in unique_targets:
            tree["label"] = 0
        elif 1 in unique_targets:
            tree["label"] = 1
        return
    
    print('ROOT')
    bc = find_best_column(data, target, columns)
    left_br = data[data[bc] <= data[bc].median()]
    right_br = data[data[bc] > data[bc].median()]
    
    print('\n\nLEFT BRANCH')
    lbc = find_best_column(left_br, target, columns)
    print('\n\nRIGHT BRANCH')
    rbc = find_best_column(right_br, target, columns)

test(data, "high_income", ["employment", "age", "marital_status"])

What I expected to happen:
Modifying find_best_column() as instructed by the exercise (adding cols = numpy.random.choice(columns, 2) and looping through cols) should run successfully and produce the correct output for this task.

What actually happened:
First the algorithm splits the dataset on the “employment” column, and then in the right branch it determines that employment category gives the greatest information gain at this node too. This subset of the data all has the same value for employment, so the split results in one dataframe with 3 rows in it (all with an employment category of 5) and one empty dataframe. In the next round of splits, we then pass this empty dataframe into calc_information_gain(), where it attempts to divide by the number of rows in the dataframe, resulting in a division by zero error.

Running the code in the above snippets will print the data at each split to illustrate the issue.

The error can be avoided by including this in the id3() definition:

def id3(data, target, columns, tree):
...
    tree["median"] = column_median
    
    unique_values = pandas.unique(data[best_column])
    if len(unique_values) == 1:
        tree["label"] = numpy.round(data[target].mean())
        return
    
    left_split = data[data[best_column] <= column_median]
...

However, this produces a result which doesn’t match the expected output for the exercise.

Is there a way to avoid this error and produce the correct output which I have missed?

Hi @william.arnott,

You’ve probably figured it out by now, but I think the problem is on this line:
highest_gain = columns[highest_gain_index]

I think it should read:
highest_gain = cols[highest_gain_index]
since the information_gains list contains the gains for the randomly selected columns in cols.

I hope this helps.

Happy learning,
Ivelina

3 Likes