Hey Alvin,
Thanx for the in depth explanation.
Sorry to bother you again.
To be precise my doubt was actually about the choice of data sets used to perform the mentioned MERGE strategies, rather than about MERGE
& JOINS
.
Sorry for the unclear question.
Will try to express my doubt more clearly this time!!
Am stuck in the below exercise :
Link :
Learn data science with Python and R projects
Mission : Data Cleaning Walkthrough: Combining the Data
Data sets & their dimensions:

Step 12. Performing the Left Joins
Doubt :
Its mentioned in the lesson that - As the data sets ap_2010
& graduation
have MANY
missing DBN
values, so we use LEFT JOIN
, & merge the sat_results
data set with them.
I get the concept of using LEFT JOIN
.
But there seem no Missing DBN
values in the data sets ap_2010
& graduation
.
I cross checked it using the below code :
data["ap_2010]["DBN"].duplicated().sum()
data["graduation"]["DBN"].duplicated().sum()
And both of them return 0
- indicating that there’re no Missing DBN
values in both the data sets.
So, when that’s the case - can we use any data set randomly from the above mentioned datasets & apply the LEFT
Join?
Step 13. Performing the Inner Joins
Doubt :
Again, its mentioned that - As the data sets class_size, demographics, survey, hs_directory
have FEWER
Missing DBN
values, we use INNER JOIN
.
Even in this case there are no Missing DBN
values in the specified data sets.
I cross checked it too using the below code :
data["class_size"]["DBN"].duplicated().sum()
data["demographics"]["DBN"].duplicated().sum()
data["survey"]["DBN"].duplicated().sum()
data["hs_directory"]["DBN"].duplicated().sum()
And each of them return 0
- indicating that there’re no Missing DBN
values in the data sets.
So, whats up with the strategy of using INNER JOIN
again?
Am I missing the actual meaning of the statement Missing "DBN" values in a data set
??
Are the Number of Missing Values
in the common key column (used to MERGE 2 data sets - DBN
in this case) - a deciding factor in finalizing the JOIN
type??
I mean, is it like :
i. If the data sets have MORE
Missing Values in the common key column (DBN)
- we use LEFT JOIN
on those data sets.
ii. If the data sets have FEWER
Missing Values in the common key column (DBN)
- we use INNER JOIN
on those data sets.
Is it so??
Am I even thinking right or am I in a completely wrong direction??
Please guide!!