Confusing Explanation

This doesn’t have to do with code, but rather with an explanation that I’m having trouble following from this screen:

https://app.dataquest.io/m/137/data-cleaning-walkthrough%3A-combining-the-data/11/left-right-inner-and-outer-joins

At the bottom of the explanation on this screen we see this, noting that an INNER JOIN would remove necessary data:

“This means that we may need to use different merge strategies with different data sets. Some of the data sets have a lot of missing DBN values. This makes a left join more appropriate, because we don’t want to lose too many rows when we merge. If we did an inner join, we would lose the data for many high schools”

But then we get the following explanation that for another df, using an INNER JOIN would preserve information that we are trying to keep:

“Some data sets have DBN values that are almost identical to those in sat_results. Those data sets also have information we need to keep. Most of our analysis would be impossible if a significant number of rows was missing from demographics, for example. Therefore, we’ll do an inner join to avoid missing data in these columns.”

Why would an INNER JOIN ignore information in one case, but not in the other?

Hi @Willyjgolden,

I think I can help explain,

For the first paragraph it is saying:

  1. Since one of those two data sets has a lot of missing DBN values, an Inner join would result in many rows being lost.
  2. A left join would preserve the data in the rows that have missing DBNs but would result in null values in the DBN column
  3. They are making the judgement that we would rather deal with null values than loose that many rows.

For the second paragraph, it is saying:

  1. For these two data sets, an inner join would mean that the resulting dataset would not have any null values in the DBN column which is desirable.
  2. Since the DBNs are nearly identical, the amount of rows that would be dropped by using inner join would be very small.
  3. In this case they are making the judgement that is is worth loosing those rows in order to have no null values.

So in the first paragraph it is referring to “lost data” (aka rows being dropped) if inner join was used; the second paragraph is referring to “avoiding missing data” (aka avoiding null values) if left join was used.

Basically its always a trade off between null values and lost data (unless the joining column matches perfectly for both datasets).

1 Like