Confusing Explantion

Screen Link: https://app.dataquest.io/m/370/working-with-missing-data/8/filling-unknown-values-with-a-placeholder

The top “cause” is an “Unspecified” placeholder. This is useful instead of a null value as it makes the distinction between a value that is missing because there were only a certain number of vehicles in the collision versus one that is because the contributing cause for a particular vehicle is unknown.

This statement doesn’t make any sense to me. Can someone please explain it in easy terms?

You are printing out the top 10 causes for the accidents. At the top is the value corresponding to the cause Unspecified. As the name suggests, the cause of the accident is not specified for that many values.

In any previous instance of inspecting some data, you would have come across NaN values. And you might have seen that there are probably 1234 NaN values for example.

The above sentence states that instead of NaN, which is used for Null or Missing Values, the term Unspecified has been used. Because the latter is more useful.

Since we don’t know the cause of the accident, Unspecified is better than saying that the values are missing. Because Unspecified causes can still be useful to the analysis as opposed to removing missing values entirely, for example.

It’s an example of how you can work with missing data depending on the data you have and the context of it.

That makes sense now.

The next paragraph states that:

The vehicles columns don’t have an equivalent, but we can still use the same technique. Here’s the logic we’ll need to do for each pair of vehicle/cause columns:

  1. For values where the vehicle is null and the cause is non-null, set the vehicle to Unspecified .
  2. For values where the cause is null and the vehicle is not-null, set the cause to Unspecified .

What equivalence condition is being referred here?

Vehicle columns don’t have Unspecified as a value.

I think I get it now. Can you please answer my another question: What is null correlation?