Is our imputation value (`Unspecified`) interpretation valid?

Hi DQ Community,

I’m working on step 8. Filling Unknown Values with a Placeholder in the Working with Missing Data mission. My question is about the logic behind using Unspecified to impute null values. I realize that pedagogically this makes sense to teach the syntax and methods of this mission, but I’m wondering if this reasoning would make sense in practice. If this were a real-life problem instead of a programming lesson, would it make sense to interpret Unspecified the way we’re interpreting it?

As I understand it, the lesson assumes that Unspecified means a vehicle was accounted for in an accident, but its cause could not be determined by the officer on the scene, so the officer entered Unspecified as the cause. I’m not sure one can necessarily say this is the case. If this were the case, then each Unspecified in a cause column would have an entry in its corresponding vehicle column (eg: cause_vehicle_2 has Unspecified and vehicle_2 has Sedan), but there are some rows for which that’s not the case. I’ll use vehicle_2 and cause_vehicle_2 as an example:

CODE 1:

# Make a DataFrame, `vc` of just the vehicle and cause data
# (Hopefully, I'm creating this DataFrame correctly)
vc = pd.concat([vehicle, cause], axis=1)
vc

OUTPUT 1:

CODE 2:

# Select cases where `vehicle_2` is null
v2nullc2 = vc.loc[vc['vehicle_2'].isnull(), ['vehicle_2', 'cause_vehicle_2']].copy()
v2nullc2

OUTPUT 2:
output2

We see that there are cases where vehicle_2 has a null entry (NaN) and cause_vehicle_2 is Unspecified, which causes me to question whether we can accurately say that Unspecified in the cause column means that the vehicle was accounted for but the officer couldn’t identify the cause and, therefore, entered Unspecified.

CODE 3:

# Select cases where 'cause_vehicle_2' is null
v2c2null = vc.loc[vc['cause_vehicle_2'].isnull(), ['vehicle_2', 'cause_vehicle_2']].copy()
v2c2null

OUTPUT 3a:
output3a

OUTPUT 3b:
output3b

There are cases like row indices 56 and 172 where there’s an entry under vehicle_2 (Taxi and Sedan, respectively) and NaN as the corresponding cause_vehicle_2 entry. This seems reasonable because I could imagine that a vehicle is accounted for while its cause was not entered or went missing for some other reason.

CODE 4:

# Select cases where 'cause_vehicle_2' is 'Unspecified'
v2c2unsp = vc.loc[vc['cause_vehicle_2'] == 'Unspecified', ['vehicle_2', 'cause_vehicle_2']].copy()
v2c2unsp

OUTPUT 4:
output4

We can see that there are instances (eg: row indices 0 and 13 where Unspecified is in the cause_vehicle_2 column and NaN is in the vehicle_2 column, which leads me to question whether we can say that for these instances a vehicle was accounted for but its contributing cause is unknown.

Hopefully this makes sense. Please let me know if I’m misunderstanding the logic behind the exercise.

Once again, I recognize the value of this logic for the purposes of teaching the lesson, but my question is more about how sound this would be in practice. Are we making a valid assumption in our interpretation?

Thanks

PS: Sub-question: Have I been using .copy() correctly in the above code? Should I just always use .copy() when assigning some part of a Pandas object to a new name?

Hey, Quinones. Nice job digging into this.

What if the vehicle couldn’t be determined so they left it blank?

I think your objection is legitimate, but I think it’s also legitimate to do what was done in this mission, regardless of the educational value.

One possible way to resolve this is to analyze different scenarios and see where it takes you, it could be that the results aren’t that much different. Or it could be that there is some institutional decision that facilitates what scenario do choose.

If you want to make sure you don’t modify the values in the original dataframe, you should use DataFrame.copy. It’s unnecessary here because you’re not even changing anything, you’re just exploring.

Hey, Bruno, thanks for your response. It makes sense. Just to clarify one thing about using .copy(). Even though I’m exploring and not trying to change the original dataframe vc, I am assigning whatever I’m doing to it to a new dataframe, v2nullc2, so I thought this was a case where .copy() was advised. Here’s an example of code:

# Select cases where `vehicle_2` is null
v2nullc2 = vc.loc[vc['vehicle_2'].isnull(), ['vehicle_2', 'cause_vehicle_2']].copy()

Also, doesn’t this do something like chained indexing which could lead to problems if .copy() is not used?

Thanks again.

You’re assigning it to a variable (v2nullc2) that is otherwise independent of the dataframe. If you were assigning it to an existing column in the dataframe, then there could be issues (you’d be modifying the dataset).

Regarding chain indexing, I stand by what I said, if you’re not modifying the dataset, there are no issues.


Edit: @quinones If you were to modify v2nullc2 later, then you’d have to be careful if you didn’t copy the object.

OK, now it makes a lot more sense! The issue arises not when assigning to a new variable, or, perhaps more fundamentally, one that’s independent of the dataframe–as you said–but when assigning onto the dataframe being modified. It seems really obvious now. Thanks again for your patience and clarification!

1 Like

Sorry, I realised later that I could have been more explicit. Glad you got it!