Working With Missing Data (Last Lesson): Struggling to understand the logic in step 8

Hi everyone,

I’m on the last lesson of the course Working With Missing Data where we have to impute the missing data with the most frequent value among all columns.

In the step 8 we use the following logic for each pair of vehicle/cause columns:

  1. For values where the vehicle is null and the cause is non-null, set the vehicle to Unspecified.
  2. For values where the cause is null and the vehicle is not-null, set the cause to Unspecified.

Ain’t we supposed to replace the vehicle with Sedan instead of Unspecified for logic # 1? Because it is the most frequent vehicle:

print(top10_vehicles)

Sedan                                  33133
Station Wagon/Sport Utility Vehicle    26124
PASSENGER VEHICLE                      16026
SPORT UTILITY / STATION WAGON          12356
Taxi                                    3482
Pick-up Truck                           2373
TAXI                                    1892
Box Truck                               1659
Bike                                    1190
Bus                                     1162
dtype: int64

For reference here is the top 10 most frequent cause of incident:

print(top_10_causes)

Unspecified                       57481
Driver Inattention/Distraction    17650
Following Too Closely              6567
Failure to Yield Right-of-Way      4566
Passing or Lane Usage Improper     3260
Passing Too Closely                3045
Backing Unsafely                   3001
Other Vehicular                    2523
Unsafe Lane Changing               2372
Turning Improperly                 1590
dtype: int64

I’ll appreciate it if someone can explain it a bit further.

Thank you!

@m.awon

You can input missing values in a categorical feature with the most frequently occurring.

You can also substitute missing values with a new category as shown in this example.

Train your model on both datasets and see which one performs better.

The reasoning is clarified in the content:

The top “cause” is an “Unspecified” placeholder. This is useful instead of a null value as it makes the distinction between a value that is missing because there were only a certain number of vehicles in the collision versus one that is because the contributing cause for a particular vehicle is unknown.

The vehicles columns don’t have an equivalent, but we can still use the same technique.

The choice to follow the same approach for vehicles as that for the cause is an intentional one.

You can certainly try to replace it with Sedan, but the question becomes if that’s a more reasonable assumption than sticking with Unspecified? If we don’t know the type of vehicle, is it better to make the assumption that it’s a sedan?

Maybe if we had more data or information, we could say so. Since we don’t it’s really about what might make more sense given what we want to achieve. As the other reply mentions, you can and should definitely try it both ways.

But be aware of what kind of biases your choice might cause to end up in the model. If a dataset included specific model numbers for vehicles instead and we chose the most frequent value, we could end up with a model that might predict that specific model number to be causing more accidents even if that might not be the case.

1 Like

This unlocks everything. Thanks.

Let’s say we had values close to equivalent, e.g. Sedan (55,000) and Unspecified (57,481), in that case, would it be wise to assign most of the values as Sedan for the vehicle column?