Performing the Joins

In regards to performing the merges of the data sets, I am interested to know if there is a logic into performing those joins in a specific order.

For example, in the 137-13 screen, we first perform inner joins on class_size, followed by demographics, survey and hs_directory.

On a very quick inspection over each of these shape, we get the following results:

(583, 8)

(1509, 38)

(1702, 23)

(435, 67)

It seems to me that the number of columns or rows does bot account for anything when when performing the inner joins. Same situation with the left joins.

How should we prioritise the data sets when we do a merge?

1 Like

Hi again Vallentin,

On that screen, you can put print(combined.shape) after each merging, to observe how the shape of the combined dataset changes. Each time, the number of columns will be growing (logically), while the number of rows will be, most probably, decreasing. Since here we’re talking about inner joins, the new number of rows in the combined dataset will not reflect the number of rows in any of 2 datasets currently being merged. Instead, it will be equal to the number of keys that are the same in both datasets. In our case, the keys are in the column DBN. Hence, with each merging, we’re finding the intersection of the keys (DBN) in both currently merged datasets.

1 Like

Hi @Elena_Kosourova,

the order of the mergers however only influences the order of the columns in the final dataset, right?

As the amount of rows will be the same with every possible order of mergers.
So why does the instruction tell us that we need to be sure to follow the exact order as instructed? Is this purely for answer checking purposes?

1 Like

Hi Léon,

Is this purely for answer checking purposes?

Not exactly. While it’s true that the order of merges determines the order of the columns in the final dataset, it determines also something else, and it all depends on the how parameter. This parameter can take the following values: ‘left’, ‘right’, ''inner, ‘outer’, or ‘cross’. So if, for example, we have 2 dataframes to be merged and we assign to how the value of ‘left’, it will take all the keys (meaning that it will preserve all the rows) from the left dataframe. If, on the same 2 dataframes, we assign ‘right’ to the how parameter, the keys will be preserved from the right dataframe, and this implies a different number of rows.

Well, in the task from that screen, actually, we need the ‘inner’ merging for each pair of dataframes, so in this particular case, the order of merging was really only a matter of answer checking purposes :slightly_smiling_face: However, for real-world-tasks, if you have to implement left or right merging, the order matters.

1 Like

Hi @Elena_Kosourova ,

I was indeed referring to this particular case where only inner joins are used.
But thanks though for the thorough explanation!

1 Like