CYBER WEEK - EXTRA SAVINGS EVENT
TRY A FREE LESSON

Performing the Joins

In regards to performing the merges of the data sets, I am interested to know if there is a logic into performing those joins in a specific order.

For example, in the 137-13 screen, we first perform inner joins on class_size, followed by demographics, survey and hs_directory.

On a very quick inspection over each of these shape, we get the following results:

data["class_size"].shape
(583, 8)

data["demographics"].shape
(1509, 38)

data["survey"].shape
(1702, 23)

data["hs_directory"].shape
(435, 67)

It seems to me that the number of columns or rows does bot account for anything when when performing the inner joins. Same situation with the left joins.

How should we prioritise the data sets when we do a merge?

1 Like

Hi again Vallentin,

On that screen, you can put print(combined.shape) after each merging, to observe how the shape of the combined dataset changes. Each time, the number of columns will be growing (logically), while the number of rows will be, most probably, decreasing. Since here we’re talking about inner joins, the new number of rows in the combined dataset will not reflect the number of rows in any of 2 datasets currently being merged. Instead, it will be equal to the number of keys that are the same in both datasets. In our case, the keys are in the column DBN. Hence, with each merging, we’re finding the intersection of the keys (DBN) in both currently merged datasets.

1 Like