Contrasting the Dataframe.merge() merge types

I am currently working on the Data Cleaning Project Walkthrough: Combining the Data mission, and we are reviewing the df.merge() method. A couple of questions come up for me:

  1. On Step 11, the reading says that a left join is more appropriate when we don’t want to lose too many rows. However, if we are concerned about losing rows, shouldn’t we do an outer join? In practice, do data analysts use left joins more often than outer joins? How do we know which one to use?

  2. On Step 13, we are instructed to use an inner join, because the datasets we are merging contain information that is more valuable to our analysis than on the previous steps. If we are concerned about losing valuable information in our datasets, why would we use an inner join, and risk this data being dropped? Why not use a left or outer join to make sure that too many rows aren’t dropped?

1 Like

You have the same 2 questions i asked a year ago. Don’t remember them being answered/ course instructions not updated.

  1. Yes outer join is the most inclusive option. Which one to use depends on the goal of the analysis, but i would say inner/left/right are more common than outer. With the latter 3 types, you can generate null values in all columns of the right table if you did LEFT JOIN for rows with joining column value that doesn’t match the joining column value of the left table. Most examples don’t show you that the id from the right table is NULL too, (because it’s a practically useless piece of info) but i demonstrate that below.
!pip install pandasql

df1 = pd.DataFrame({'id':[1,2,3,],'val':[10,20,20]})
df2 = pd.DataFrame({'id':[2,3,4,],'val':[100,200,200]})


from pandasql import sqldf


q = """
    SELECT * 
    FROM df1
    LEFT JOIN df2 
    ON df1.id == df2.id
    """
pysqldf = lambda q: sqldf(q, globals())
pysqldf(q)

After generating null, you can count null to understand how many rows are in left but not right.

  1. That part must be wrong. Inner join loses the most information. Outer join instead preserves all information.
1 Like

Hi Han/Others,

I have the same questions as @znarkin .

I couldn’t understand your explanation clearly.

Could you please explain in more detail why the LEFT & INNER joins were chosen as the best MERGE strategy in comparison with OUTER join, when its obvious that OUTER retains all the observations compared to LEFT & INNER?

Here are the outcomes of each of the joins :

LEFT Join results :

INNER Join results :

OUTER Join results :

So, when OUTER Join is the most optimal MERGE strategy to pick, why instead were the LEFT & INNER Joins chosen in this exercise allowing loss of one observation/row (DBN 55)??

Please guide.

Thanx in advance!!

Whenever we perform JOIN operation, we need two tables (DataFrame) and a condition defined by ON clauses.

These two tables, let’s call LEFT TABLE, and RIGHT TABLE.

For INNER JOIN, for some row in LEFT TABLE and for some row in RIGHT TABLE matches the ON clauses. Then the matched row contains columns from the LEFT TABLE and columns from RIGHT TABLE. If no matched is found on the ON clauses, nothing is return.

For LEFT JOIN, EVERYTHING on LEFT TABLE, and includes the following on each row of the LEFT TABLE:

  • NULL values for RIGHT TABLE's columns if no match is found for ON clauses

  • Respective RIGHT TABLE‘s columns’ value if match is found for ON clauses.

    • For SQL Query, the respective RIGHT TABLE columns indicated under the SQL SELECT clause.

For RIGHT JOIN, EVERYTHING on RIGHT TABLE, and includes the following on each row of the RIGHT TABLE:

  • NULL values for LEFT TABLE's columns if no match is found for ON clauses

  • Respective LEFT TABLE‘s columns’ value if match is found for ON clauses.

    • For SQL Query, the respective LEFT TABLE columns indicated under the SQL SELECT clause.

For OUTER JOIN includes EVERYTHING in the LEFT TABLE and RIGHT TABLE.

  • OUTER JOIN should be FULL OUTER JOIN.

  • Match on the ON clause

    • Each matched rows has columns of LEFT TABLE and RIGHT TABLE

    • This is similar to INNER JOIN.

  • No match on the ON clause

    • Each non-matched rows of LEFT TABLE with NULL values for RIGHT TABLE columns

    • Each non-matched rows of RIGHT TABLE with NULL values for LEFT TABLE columns

NULL values in Pandas are describe as NaN.

To determine what JOIN to use, work backwards from the final table with columns you need. This depends on what question is being asked.

In order to answer a particular question, you may need to do a JOIN operation. Most of time, the JOINs is either LEFT JOIN or RIGHT JOIN. Unless you need matching data from both table, then INNER JOIN.

If the JOINs you need is either LEFT JOIN or RIGHT JOIN, then set the table with most columns you need to answer as RIGHT TABLE or LEFT TABLE. If you choose RIGHT TABLE, then do RIGHT JOIN. And, if you choose LEFT TABLE, then do LEFT JOIN.

My guess is LEFT JOIN sets up the columns/values we are interested to know.

For LEFT JOIN, EVERYTHING on LEFT TABLE, and includes the following on each row of the LEFT TABLE:

  • NULL values for RIGHT TABLE's columns if no match is found for ON clauses
  • Respective RIGHT TABLE‘s columns’ value if match is found for ON clauses.
    • For SQL Query, the respective RIGHT TABLE columns indicated under the SQL SELECT clause.

Then INNER JOIN simply gets matched rows from other table.

Summary:

  1. Use LEFT JOIN or RIGHT JOIN to setup rows and columns to answer question. Only use this set of rows to answer questions.

  2. Then use INNER JOIN to retrieve other necessary data from other table to answer question. Retrieve additional columns from other table.

  3. When using OUTER JOIN may included not required (or unnecessary) columns and rows to answer question. The issues here is additional rows not required to answer question.

  4. Depending on question, a combination of left/right/inner joins are more likely to be used.

Hey Alvin,

Thanx for the in depth explanation.

Sorry to bother you again.

To be precise my doubt was actually about the choice of data sets used to perform the mentioned MERGE strategies, rather than about MERGE & JOINS.

Sorry for the unclear question.
Will try to express my doubt more clearly this time!!

Am stuck in the below exercise :

Link : Learn data science with Python and R projects

Mission : Data Cleaning Walkthrough: Combining the Data

Data sets & their dimensions: image

Step 12. Performing the Left Joins

Doubt : Its mentioned in the lesson that - As the data sets ap_2010 & graduation have MANY missing DBN values, so we use LEFT JOIN, & merge the sat_results data set with them.

I get the concept of using LEFT JOIN.
But there seem no Missing DBN values in the data sets ap_2010 & graduation.

I cross checked it using the below code :
data["ap_2010]["DBN"].duplicated().sum()
data["graduation"]["DBN"].duplicated().sum()

And both of them return 0 - indicating that there’re no Missing DBN values in both the data sets.

So, when that’s the case - can we use any data set randomly from the above mentioned datasets & apply the LEFT Join?

Step 13. Performing the Inner Joins

Doubt : Again, its mentioned that - As the data sets class_size, demographics, survey, hs_directory have FEWER Missing DBN values, we use INNER JOIN.

Even in this case there are no Missing DBN values in the specified data sets.

I cross checked it too using the below code :
data["class_size"]["DBN"].duplicated().sum()
data["demographics"]["DBN"].duplicated().sum()
data["survey"]["DBN"].duplicated().sum()
data["hs_directory"]["DBN"].duplicated().sum()

And each of them return 0 - indicating that there’re no Missing DBN values in the data sets.

So, whats up with the strategy of using INNER JOIN again?

Am I missing the actual meaning of the statement Missing "DBN" values in a data set??

Are the Number of Missing Values in the common key column (used to MERGE 2 data sets - DBN in this case) - a deciding factor in finalizing the JOIN type??

I mean, is it like :

i. If the data sets have MORE Missing Values in the common key column (DBN) - we use LEFT JOIN on those data sets.

ii. If the data sets have FEWER Missing Values in the common key column (DBN) - we use INNER JOIN on those data sets.

Is it so??

Am I even thinking right or am I in a completely wrong direction??

Please guide!!