# Is our imputation value (`Unspecified`) interpretation valid?

Hi DQ Community,

I’m working on step 8. Filling Unknown Values with a Placeholder in the Working with Missing Data mission. My question is about the logic behind using `Unspecified` to impute null values. I realize that pedagogically this makes sense to teach the syntax and methods of this mission, but I’m wondering if this reasoning would make sense in practice. If this were a real-life problem instead of a programming lesson, would it make sense to interpret `Unspecified` the way we’re interpreting it?

As I understand it, the lesson assumes that `Unspecified` means a vehicle was accounted for in an accident, but its cause could not be determined by the officer on the scene, so the officer entered `Unspecified` as the cause. I’m not sure one can necessarily say this is the case. If this were the case, then each `Unspecified` in a cause column would have an entry in its corresponding vehicle column (eg: `cause_vehicle_2` has `Unspecified` and `vehicle_2` has `Sedan`), but there are some rows for which that’s not the case. I’ll use `vehicle_2` and `cause_vehicle_2` as an example:

CODE 1:

``````# Make a DataFrame, `vc` of just the vehicle and cause data
# (Hopefully, I'm creating this DataFrame correctly)
vc = pd.concat([vehicle, cause], axis=1)
vc
``````

OUTPUT 1:

CODE 2:

``````# Select cases where `vehicle_2` is null
v2nullc2 = vc.loc[vc['vehicle_2'].isnull(), ['vehicle_2', 'cause_vehicle_2']].copy()
v2nullc2
``````

OUTPUT 2:

We see that there are cases where `vehicle_2` has a null entry (`NaN`) and `cause_vehicle_2` is `Unspecified`, which causes me to question whether we can accurately say that `Unspecified` in the cause column means that the vehicle was accounted for but the officer couldn’t identify the cause and, therefore, entered `Unspecified`.

CODE 3:

``````# Select cases where 'cause_vehicle_2' is null
v2c2null = vc.loc[vc['cause_vehicle_2'].isnull(), ['vehicle_2', 'cause_vehicle_2']].copy()
v2c2null
``````

OUTPUT 3a:

OUTPUT 3b:

There are cases like row indices `56` and `172` where there’s an entry under `vehicle_2` (`Taxi` and `Sedan`, respectively) and `NaN` as the corresponding `cause_vehicle_2` entry. This seems reasonable because I could imagine that a vehicle is accounted for while its cause was not entered or went missing for some other reason.

CODE 4:

``````# Select cases where 'cause_vehicle_2' is 'Unspecified'
v2c2unsp = vc.loc[vc['cause_vehicle_2'] == 'Unspecified', ['vehicle_2', 'cause_vehicle_2']].copy()
v2c2unsp
``````

OUTPUT 4:

We can see that there are instances (eg: row indices `0` and `13` where `Unspecified` is in the `cause_vehicle_2` column and `NaN` is in the `vehicle_2` column, which leads me to question whether we can say that for these instances a vehicle was accounted for but its contributing cause is unknown.

Hopefully this makes sense. Please let me know if I’m misunderstanding the logic behind the exercise.

Once again, I recognize the value of this logic for the purposes of teaching the lesson, but my question is more about how sound this would be in practice. Are we making a valid assumption in our interpretation?

Thanks

PS: Sub-question: Have I been using `.copy()` correctly in the above code? Should I just always use `.copy()` when assigning some part of a Pandas object to a new name?

Hey, Quinones. Nice job digging into this.

What if the vehicle couldn’t be determined so they left it blank?

I think your objection is legitimate, but I think it’s also legitimate to do what was done in this mission, regardless of the educational value.

One possible way to resolve this is to analyze different scenarios and see where it takes you, it could be that the results aren’t that much different. Or it could be that there is some institutional decision that facilitates what scenario do choose.

If you want to make sure you don’t modify the values in the original dataframe, you should use `DataFrame.copy`. It’s unnecessary here because you’re not even changing anything, you’re just exploring.

Hey, Bruno, thanks for your response. It makes sense. Just to clarify one thing about using `.copy()`. Even though I’m exploring and not trying to change the original dataframe `vc`, I am assigning whatever I’m doing to it to a new dataframe, `v2nullc2`, so I thought this was a case where `.copy()` was advised. Here’s an example of code:

``````# Select cases where `vehicle_2` is null
v2nullc2 = vc.loc[vc['vehicle_2'].isnull(), ['vehicle_2', 'cause_vehicle_2']].copy()
``````

Also, doesn’t this do something like chained indexing which could lead to problems if `.copy()` is not used?

Thanks again.

You’re assigning it to a variable (`v2nullc2`) that is otherwise independent of the dataframe. If you were assigning it to an existing column in the dataframe, then there could be issues (you’d be modifying the dataset).

Regarding chain indexing, I stand by what I said, if you’re not modifying the dataset, there are no issues.

Edit: @quinones If you were to modify `v2nullc2` later, then you’d have to be careful if you didn’t copy the object.

OK, now it makes a lot more sense! The issue arises not when assigning to a new variable, or, perhaps more fundamentally, one that’s independent of the dataframe–as you said–but when assigning onto the dataframe being modified. It seems really obvious now. Thanks again for your patience and clarification!

1 Like

Sorry, I realised later that I could have been more explicit. Glad you got it!