Concatenating issue

Hi,

I have way to many lines when concatenating the two datasets together. Note that I’m using the original dataset, not the one provided by Dataquest.

Here is what I’m doing :

dete_resignations['institute'] = "DETE"
tafe_resignations['institute'] = "TAFE"

print("The DETE dataset has {rows} rows and {columns} columns.".format(rows = dete_resignations.shape[0], columns = dete_resignations.shape[1]))
print('\n')
print("The TAFE dataset has {rows} rows and {columns} columns.".format(rows = tafe_resignations.shape[0], columns = tafe_resignations.shape[1]))
print('\n')

###Combining :
combined = pd.concat([dete_resignations, tafe_resignations], axis = 1, ignore_index= True)

print("After combining the two datasets together, the new combined dataset has {rows} rows and {columns} columns.".format(rows = combined.shape[0], columns = combined.shape[1]))

Here is the output :

The DETE dataset has 311 rows and 38 columns.

The TAFE dataset has 340 rows and 24 columns.

After combining the two datasets together, the new combined dataset has 512 rows and 62 columns.

The number of columns is as expected, but I don’t get why the number of rows is almost doubled (and does not even match the total of rows).

Thanks for your lights !

1 Like

hi @nicolas_mtl

Are the indexes of both the data frames in the correct order, as in, row 1 = 0 index and row n = n-1 index?

The "ignore_index" here will ignore the indexes of the columns as the axis value has been set to 1. The concatenation will still consider the index values of the row axis.

I was able to replicate the issue like this:

# create data frames 
df1 = pd.DataFrame({"col11" : [10, 20, 30, 40], "col12" : ["a", "b", "c", "d"]})
df2 = pd.DataFrame({"col21" : [100, 200, 300, 400], "col22" : ["aa", "bb", "cc", "dd"]})

# change row index for df2 to re-produce mis-match of row axis
df2.index = [0, 1, 4, 6]

# concat the two dataframes
df = pd.concat([df1, df2], axis = 1, ignore_index = False) 

print("Shape after concatenation:", df.shape)

image

And if I don’t re-index, then the shape matches.

# create data frames 
df1 = pd.DataFrame({"col11" : [10, 20, 30, 40], "col12" : ["a", "b", "c", "d"]})
df2 = pd.DataFrame({"col21" : [100, 200, 300, 400], "col22" : ["aa", "bb", "cc", "dd"]})

# change row index for df2 to re-produce mis-match of row axis
# df2.index = [0, 1, 4, 6]

# concat the two dataframes
df = pd.concat([df1, df2], axis = 1, ignore_index = False) 

print("Shape after concatenation:", df.shape)

image

Did you reset the index after filtering out records with only resignations?

4 Likes

Hi @Rucha,

Thanks for your answer. I don’t see any differences between your two codes ?

However, yes, resetting the index solved the issue with using :

dete_resignations = dete_resignations.reset_index()
tafe_resignations = tafe_resignations.reset_index()

Thanks for the help !

1 Like

hi @nicolas_mtl
this code is commented in the second one so no change in index for df2.

2 Likes