Equals (=) vs shallow copy vs deep copy in Pandas Dataframes

In this article, I’m going to show you 3 ways of copying Dataframes in Pandas, explain how they are all different and tell you when you should use either.

Equals (df1 = df)

When we assign a DataFrame to a new variable using =, we are not creating a new copy of the DataFrame. We are merely adding a new name to call the same object. The below example will help you to understand it better. Notice that the id of both df and df1 objects are the same. So whether you modify df or df1, you are changing the same object.

import pandas as pd

df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})

df1 = df

df1 is df
Out[4]: True

df1['A'] = [4, 5, 6]

df
Out[6]: 
   A  B
0  4  4
1  5  5
2  6  6

id(df1)
Out[7]: 140602502014768

id(df)
Out[8]: 140602502014768

When to use = to assign a new DataFrame?

  • If you think those colorful SettingwithCopyWarning messages will make your Jupyter notebook look beautiful. :stuck_out_tongue_winking_eye:
  • If you are planning to modify existing DataFrames, either do not create a new variable or use a shallow copy to achieve a similar result.

Shallow Copy [DataFrame.copy(deep=False)]

When we create a copy of the DataFrame using shallow copy (DataFrame.copy(deep=False)), we are creating a copy with shared data and index. What this means is that we will be creating two separate objects, but the index and data will be shared. So just like in the case of =, any modifications to the shallow copy will affect the original DataFrame.

import pandas as pd

df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})

shallow_copy = df.copy(deep=False)

shallow_copy is df
Out[4]: False

shallow_copy['A'] = [4, 5, 6]

df
Out[6]: 
   A  B
0  4  4
1  5  5
2  6  6

id(shallow_copy)
Out[7]: 140361681612224

id(df)
Out[8]: 140361681847776

When to use shallow copy?

When you want to share updates between DataFrames entirely or partially, if you create a new column on the shallow copy, it will not reflect in the original DataFrame. The update is only shared if you work with data that is available in the original DataFrame. Here is an example.

import pandas as pd

df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})

shallow_copy = df.copy(deep=False)

shallow_copy['C'] = [9, 9, 9]

shallow_copy
Out[5]: 
   A  B  C
0  1  4  9
1  2  5  9
2  3  6  9

df
Out[6]: 
   A  B
0  1  4
1  2  5
2  3  6

shallow_copy['A'] = [0, 0, 0]

df
Out[8]: 
   A  B
0  0  4
1  0  5
2  0  6

Deep Copy [DataFrame.copy(deep=True)]

When we create a copy of the DataFrame using deep copy (DataFrame.copy(deep=True)), we are creating a copy that has its own data and index. So any modifications to the new DataFrame will not modify the original DataFrame.

import pandas as pd

df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})

deep_copy = df.copy(deep=True)

deep_copy is df
Out[4]: False

deep_copy['A'] = [4, 5, 6]

df
Out[6]: 
   A  B
0  1  4
1  2  5
2  3  6

id(deep_copy)
Out[7]: 140582033517440

id(df)
Out[8]: 140582033753904

When to use deep copy?

You should use deep copy when you don’t want your changes to affect the original DataFrame.

9 Likes