I’m on the last lesson of the course Working With Missing Data where we have to impute the missing data with the most frequent value among all columns.
In the step 8 we use the following logic for each pair of vehicle/cause columns:
For values where the vehicle is null and the cause is non-null, set the vehicle to Unspecified.
For values where the cause is null and the vehicle is not-null, set the cause to Unspecified.
Ain’t we supposed to replace the vehicle with Sedan instead of Unspecified for logic # 1? Because it is the most frequent vehicle:
print(top10_vehicles)
Sedan 33133
Station Wagon/Sport Utility Vehicle 26124
PASSENGER VEHICLE 16026
SPORT UTILITY / STATION WAGON 12356
Taxi 3482
Pick-up Truck 2373
TAXI 1892
Box Truck 1659
Bike 1190
Bus 1162
dtype: int64
For reference here is the top 10 most frequent cause of incident:
print(top_10_causes)
Unspecified 57481
Driver Inattention/Distraction 17650
Following Too Closely 6567
Failure to Yield Right-of-Way 4566
Passing or Lane Usage Improper 3260
Passing Too Closely 3045
Backing Unsafely 3001
Other Vehicular 2523
Unsafe Lane Changing 2372
Turning Improperly 1590
dtype: int64
I’ll appreciate it if someone can explain it a bit further.
The top “cause” is an “Unspecified” placeholder. This is useful instead of a null value as it makes the distinction between a value that is missing because there were only a certain number of vehicles in the collision versus one that is because the contributing cause for a particular vehicle is unknown.
The vehicles columns don’t have an equivalent, but we can still use the same technique.
The choice to follow the same approach for vehicles as that for the cause is an intentional one.
You can certainly try to replace it with Sedan, but the question becomes if that’s a more reasonable assumption than sticking with Unspecified? If we don’t know the type of vehicle, is it better to make the assumption that it’s a sedan?
Maybe if we had more data or information, we could say so. Since we don’t it’s really about what might make more sense given what we want to achieve. As the other reply mentions, you can and should definitely try it both ways.
But be aware of what kind of biases your choice might cause to end up in the model. If a dataset included specific model numbers for vehicles instead and we chose the most frequent value, we could end up with a model that might predict that specific model number to be causing more accidents even if that might not be the case.
Let’s say we had values close to equivalent, e.g. Sedan (55,000) and Unspecified (57,481), in that case, would it be wise to assign most of the values as Sedan for the vehicle column?