Why the order matters when combining the datasets?

Hello everyone,

I’m on the second lesson of Data Cleaning Project Walkthrough where we have to implement different techniques to combine the datasets of NYC high schools.

Here’s the reference link where we have to merge multiple datasets with combined dataset.

In the instructions we emphasize keeping the following order while merging the datasets:

  • First, we merge the class_size dataset with the combined dataset
  • Then demographics with combined
  • Then survey with combined
  • And in the end, it is hs_directory with combined

I need to grasp the basic idea behind this logic.
I’ll appreciate your time.
Thank you!

Here is the screenshot referring to my question:

In the database, there are different tables. Each table has its own set of data.

We ask ourselves some question about “class size survey demographics”. In order to answer these question, you need to figure out what tables are required.

Working backwards, we figured out the main data table required, that is combined.

The final data needs additional data from

  • class_size,
  • demographics,
  • survey
  • hs_directory

Using LEFT JOIN because we want everything from combined data set. (Or RIGHT JOIN is also possible but you need to change the syntax).

Description below does not have the correct syntax but are used to explain the logic behind the merges.

A = (combined LEFT JOIN class_size)

Using LEFT JOIN because we want everything from combined table that may contains matched row from class_size or non-matched rows as NULL.

B = ( A LEFT JOIN demographics )

Keep building from A. Similar logic, because we want everything from previously build table A that may contains matched row from demographics or non-matched rows as NULL.

C = (B LEFT JOIN survey )

Keep building from B. Similar logic, because we want everything from previously build table B that may contains matched row from survey or non-matched rows as NULL.

D = (C LEFT JOIN hs_directory )

Keep building from C. Similar logic, because we want everything from previously build table C that may contains matched row from hs_directory or non-matched rows as NULL.

Why order matters?

The first merge matters the most because we determined combined table to build our data from. That is (combined LEFT JOIN other_table). other_table can be any table, but need to satisfy the JOIN ON condition where both tables performing the join has similar column(s).

The first merge matters the most because we determined combined table to build our data from.

The final results contains exact the number rows in combined table.

Thank you for the detailed explanation. My concern was, what if I change the order of the tables from (class_size, demographics, survey, hs_directory) to (hs_directory, survey, class_size, demographics) to merge with the combined, would that change the final result? And you’ve talked about it at the end of your explanation which clears my confusion that is (combined LEFT JOIN can be any table).

Thank you!

1 Like