I’m on the last lesson of the course Working With Missing Data where we have to impute the missing data with the most frequent value among all columns.
In the step 8 we use the following logic for each pair of vehicle/cause columns:
For values where the vehicle is null and the cause is non-null, set the vehicle to Unspecified.
For values where the cause is null and the vehicle is not-null, set the cause to Unspecified.
Ain’t we supposed to replace the vehicle with Sedan instead of Unspecified for logic # 1? Because it is the most frequent vehicle:
Station Wagon/Sport Utility Vehicle 26124
PASSENGER VEHICLE 16026
SPORT UTILITY / STATION WAGON 12356
Pick-up Truck 2373
Box Truck 1659
For reference here is the top 10 most frequent cause of incident:
Driver Inattention/Distraction 17650
Following Too Closely 6567
Failure to Yield Right-of-Way 4566
Passing or Lane Usage Improper 3260
Passing Too Closely 3045
Backing Unsafely 3001
Other Vehicular 2523
Unsafe Lane Changing 2372
Turning Improperly 1590
I’ll appreciate it if someone can explain it a bit further.
The top “cause” is an “Unspecified” placeholder. This is useful instead of a null value as it makes the distinction between a value that is missing because there were only a certain number of vehicles in the collision versus one that is because the contributing cause for a particular vehicle is unknown.
The vehicles columns don’t have an equivalent, but we can still use the same technique.
The choice to follow the same approach for vehicles as that for the cause is an intentional one.
You can certainly try to replace it with Sedan, but the question becomes if that’s a more reasonable assumption than sticking with Unspecified? If we don’t know the type of vehicle, is it better to make the assumption that it’s a sedan?
Maybe if we had more data or information, we could say so. Since we don’t it’s really about what might make more sense given what we want to achieve. As the other reply mentions, you can and should definitely try it both ways.
But be aware of what kind of biases your choice might cause to end up in the model. If a dataset included specific model numbers for vehicles instead and we chose the most frequent value, we could end up with a model that might predict that specific model number to be causing more accidents even if that might not be the case.