In regards to performing the merges of the data sets, I am interested to know if there is a logic into performing those joins in a specific order.
For example, in the 137-13 screen, we first perform
inner joins on
class_size, followed by
On a very quick inspection over each of these shape, we get the following results:
It seems to me that the number of columns or rows does bot account for anything when when performing the
inner joins. Same situation with the
How should we prioritise the data sets when we do a merge?
Hi again Vallentin,
On that screen, you can put
print(combined.shape) after each merging, to observe how the shape of the
combined dataset changes. Each time, the number of columns will be growing (logically), while the number of rows will be, most probably, decreasing. Since here we’re talking about inner joins, the new number of rows in the
combined dataset will not reflect the number of rows in any of 2 datasets currently being merged. Instead, it will be equal to the number of keys that are the same in both datasets. In our case, the keys are in the column
DBN. Hence, with each merging, we’re finding the intersection of the keys (
DBN) in both currently merged datasets.
the order of the mergers however only influences the order of the columns in the final dataset, right?
As the amount of rows will be the same with every possible order of mergers.
So why does the instruction tell us that we need to be sure to follow the exact order as instructed? Is this purely for answer checking purposes?
Is this purely for answer checking purposes?
Not exactly. While it’s true that the order of merges determines the order of the columns in the final dataset, it determines also something else, and it all depends on the
how parameter. This parameter can take the following values: ‘left’, ‘right’, ''inner, ‘outer’, or ‘cross’. So if, for example, we have 2 dataframes to be merged and we assign to
how the value of ‘left’, it will take all the keys (meaning that it will preserve all the rows) from the left dataframe. If, on the same 2 dataframes, we assign ‘right’ to the
how parameter, the keys will be preserved from the right dataframe, and this implies a different number of rows.
Well, in the task from that screen, actually, we need the ‘inner’ merging for each pair of dataframes, so in this particular case, the order of merging was really only a matter of answer checking purposes However, for real-world-tasks, if you have to implement left or right merging, the order matters.
Hi @Elena_Kosourova ,
I was indeed referring to this particular case where only inner joins are used.
But thanks though for the thorough explanation!