Hi DQ Community,
I’m working on step 8. Filling Unknown Values with a Placeholder in the Working with Missing Data mission. My question is about the logic behind using Unspecified
to impute null values. I realize that pedagogically this makes sense to teach the syntax and methods of this mission, but I’m wondering if this reasoning would make sense in practice. If this were a real-life problem instead of a programming lesson, would it make sense to interpret Unspecified
the way we’re interpreting it?
As I understand it, the lesson assumes that Unspecified
means a vehicle was accounted for in an accident, but its cause could not be determined by the officer on the scene, so the officer entered Unspecified
as the cause. I’m not sure one can necessarily say this is the case. If this were the case, then each Unspecified
in a cause column would have an entry in its corresponding vehicle column (eg: cause_vehicle_2
has Unspecified
and vehicle_2
has Sedan
), but there are some rows for which that’s not the case. I’ll use vehicle_2
and cause_vehicle_2
as an example:
CODE 1:
# Make a DataFrame, `vc` of just the vehicle and cause data
# (Hopefully, I'm creating this DataFrame correctly)
vc = pd.concat([vehicle, cause], axis=1)
vc
OUTPUT 1:
CODE 2:
# Select cases where `vehicle_2` is null
v2nullc2 = vc.loc[vc['vehicle_2'].isnull(), ['vehicle_2', 'cause_vehicle_2']].copy()
v2nullc2
OUTPUT 2:
We see that there are cases where vehicle_2
has a null entry (NaN
) and cause_vehicle_2
is Unspecified
, which causes me to question whether we can accurately say that Unspecified
in the cause column means that the vehicle was accounted for but the officer couldn’t identify the cause and, therefore, entered Unspecified
.
CODE 3:
# Select cases where 'cause_vehicle_2' is null
v2c2null = vc.loc[vc['cause_vehicle_2'].isnull(), ['vehicle_2', 'cause_vehicle_2']].copy()
v2c2null
OUTPUT 3a:
OUTPUT 3b:
There are cases like row indices 56
and 172
where there’s an entry under vehicle_2
(Taxi
and Sedan
, respectively) and NaN
as the corresponding cause_vehicle_2
entry. This seems reasonable because I could imagine that a vehicle is accounted for while its cause was not entered or went missing for some other reason.
CODE 4:
# Select cases where 'cause_vehicle_2' is 'Unspecified'
v2c2unsp = vc.loc[vc['cause_vehicle_2'] == 'Unspecified', ['vehicle_2', 'cause_vehicle_2']].copy()
v2c2unsp
OUTPUT 4:
We can see that there are instances (eg: row indices 0
and 13
where Unspecified
is in the cause_vehicle_2
column and NaN
is in the vehicle_2
column, which leads me to question whether we can say that for these instances a vehicle was accounted for but its contributing cause is unknown.
Hopefully this makes sense. Please let me know if I’m misunderstanding the logic behind the exercise.
Once again, I recognize the value of this logic for the purposes of teaching the lesson, but my question is more about how sound this would be in practice. Are we making a valid assumption in our interpretation?
Thanks
PS: Sub-question: Have I been using .copy()
correctly in the above code? Should I just always use .copy()
when assigning some part of a Pandas object to a new name?