# Why the order matters when combining the datasets?

Hello everyone,

I’m on the second lesson of Data Cleaning Project Walkthrough where we have to implement different techniques to combine the datasets of NYC high schools.

Here’s the reference link where we have to merge multiple datasets with `combined` dataset.

In the instructions we emphasize keeping the following order while merging the datasets:

• First, we merge the `class_size` dataset with the `combined` dataset
• Then `demographics` with `combined`
• Then `survey` with `combined`
• And in the end, it is `hs_directory` with `combined`

I need to grasp the basic idea behind this logic.
Thank you!

Here is the screenshot referring to my question:

In the database, there are different tables. Each table has its own set of data.

We ask ourselves some question about “class size survey demographics”. In order to answer these question, you need to figure out what tables are required.

Working backwards, we figured out the main data table required, that is `combined`.

The final data needs additional data from

• `class_size`,
• `demographics`,
• `survey`
• `hs_directory`

Using `LEFT JOIN` because we want everything from `combined` data set. (Or `RIGHT JOIN` is also possible but you need to change the syntax).

Description below does not have the correct syntax but are used to explain the logic behind the merges.

A = (`combined LEFT JOIN class_size`)

Using `LEFT JOIN` because we want everything from `combined` table that may contains matched row from `class_size` or non-matched rows as `NULL`.

B = ( `A LEFT JOIN demographics` )

Keep building from `A`. Similar logic, because we want everything from previously build table `A` that may contains matched row from `demographics` or non-matched rows as `NULL`.

C = (`B LEFT JOIN survey` )

Keep building from `B`. Similar logic, because we want everything from previously build table `B` that may contains matched row from `survey` or non-matched rows as `NULL`.

D = (`C LEFT JOIN hs_directory` )

Keep building from `C`. Similar logic, because we want everything from previously build table `C` that may contains matched row from `hs_directory` or non-matched rows as `NULL`.

Why order matters?

The first merge matters the most because we determined `combined` table to build our data from. That is (`combined` `LEFT JOIN` `other_table`). `other_table` can be any table, but need to satisfy the `JOIN ON` condition where both tables performing the join has similar column(s).

The first merge matters the most because we determined `combined` table to build our data from.

The final results contains exact the number rows in `combined` table.

Thank you for the detailed explanation. My concern was, what if I change the order of the tables from (`class_size`, `demographics`, `survey`, `hs_directory`) to (`hs_directory`, `survey`, `class_size`, `demographics`) to merge with the `combined`, would that change the final result? And you’ve talked about it at the end of your explanation which clears my confusion that is (`combined` `LEFT JOIN` `can be any table`).

Thank you!

1 Like