Use `.copy()` or not?

The lesson Working with Missing Data, 8. Filling Unknown Values with a Placeholder shows the following code:

# create a mask for each column
v_missing_mask = mvc['vehicle_1'].isnull() & mvc['cause_vehicle_1'].notnull()
c_missing_mask = mvc['cause_vehicle_1'].isnull() & mvc['vehicle_1'].notnull()

# replace the values matching the mask for each column
mvc['vehicle_1'] =  mvc['vehicle_1'].mask(v_missing_mask, "Unspecified")
mvc['cause_vehicle_1'] =  mvc['cause_vehicle_1'].mask(c_missing_mask, "Unspecified")

Would it be better to use .copy() such that the code is

# create a mask for each column
v_missing_mask = (mvc['vehicle_1'].isnull() & mvc['cause_vehicle_1'].notnull()).copy()
c_missing_mask = (mvc['cause_vehicle_1'].isnull() & mvc['vehicle_1'].notnull()).copy()

# replace the values matching the mask for each column
mvc['vehicle_1'] = (mvc['vehicle_1'].mask(v_missing_mask, "Unspecified")).copy()
mvc['cause_vehicle_1'] = (mvc['cause_vehicle_1'].mask(c_missing_mask, "Unspecified")).copy()

?

Thanks

You should use DataFrame.copy/Series.copy when you want to make sure you don’t modify the original object.

Here you actually want to modify the original object, so it’s not useful, but it causes no harm either.

1 Like

Hi @Bruno,

I wanted to ask a follow-up question on this discussion. I’ve been reading the Dataquest article, SettingwithCopyWarning: How to Fix This Warning in Pandas, and am trying to better understand their example of Common issue #2: Hidden chaining.

Let’s say that I wanted to create a new DataFrame based on some transformation of an existing DataFrame, but I also wanted to preserve the old DataFrame without modifying it. Should I apply copy() before or after the transformation? For example, let’s say df1 is my original DataFrame.

linear = [x for x in range(-3, 4, 1)]
cubic = [x ** 3 for x in range(-3, 4, 1)]
quintic = [x ** 5 for x in range(-3, 4, 1) ]

import pandas as pd
df1 = pd.DataFrame({'lin' : linear, 'cub' : cubic, 'quin' : quintic})
print(df1)
   lin  cub  quin
0   -3  -27  -243
1   -2   -8   -32
2   -1   -1    -1
3    0    0     0
4    1    1     1
5    2    8    32
6    3   27   243

Now, I’d like to create a new DataFrame from a portion of df1 so I can transform it later without running the risk of modifying df1. One option is to copy df1 before slicing it…

df2 = df1.copy().loc[df1.lin == df1.quin]
print(df2)
   lin  cub  quin
2   -1   -1    -1
3    0    0     0
4    1    1     1

…so that when I modify df2

df2.loc[3, 'cub'] = "Hello"
print(df2)
   lin    cub  quin
2   -1     -1    -1
3    0  Hello     0
4    1      1     1

df1 stays the same.

print(df1)
   lin  cub  quin
0   -3  -27  -243
1   -2   -8   -32
2   -1   -1    -1
3    0    0     0
4    1    1     1
5    2    8    32
6    3   27   243

Now, let’s say I create a different DataFrame called df3 from the same portion of df1, but this time I apply copy() after slicing it.

df3 = df1.loc[df1.lin == df1.quin].copy()
print(df3)
   lin  cub  quin
2   -1   -1    -1
3    0    0     0
4    1    1     1

The result looks the same as df2, so, now, I want to see if modifying df3 will affect df1.

df3.loc[3, 'cub'] = 'Greetings'
print(df3)
   lin        cub  quin
2   -1         -1    -1
3    0  Greetings     0
4    1          1     1
print(df1)
   lin  cub  quin
0   -3  -27  -243
1   -2   -8   -32
2   -1   -1    -1
3    0    0     0
4    1    1     1
5    2    8    32
6    3   27   243

It doesn’t seem to matter whether I apply copy() before or after the transformation even though it seems that logically, you’d want to apply it before as when creating df2. Is that right?

I’m going to try one last experiment in which I do not use copy() to see if it modifies the original DataFrame.

df4 = df1.loc[df1.lin == df1.quin]
print(df4)
   lin  cub  quin
2   -1   -1    -1
3    0    0     0
4    1    1     1
df4.loc[3, 'cub'] = "Whassup!"
print(df4)
   lin       cub  quin
2   -1        -1    -1
3    0  Whassup!     0
4    1         1     1
/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:543: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

I got the SettingWithCopyWarning, which I expected, but nothing seemed to happen to the original DataFrame (this is also expected since it’s ambiguous whether get operations return views or copies).

print(df1)
   lin  cub  quin
0   -3  -27  -243
1   -2   -8   -32
2   -1   -1    -1
3    0    0     0
4    1    1     1
5    2    8    32
6    3   27   243

So, it’s really unclear to me when it’s necessary to use copy() and whether you should use it before a transformation as in df2 or after the transformation as in df3…or am I missing something essential here?

Thanks.

I think the only thing you’re missing is that this behavior isn’t deterministic. Sometimes it may lead to issues, sometimes it won’t.

You can be profylactic and use pandas.DataFrame.copy.

Regarding the order, you should do it before modifying it, I would say. It’s possible that under the hood pandas protects us from such issues in that specific case, but there isn’t much to gain from not copying the dataframe.

Is this approach, of using copy(), often used in the industry as well? With significantly large datasets I would assume that copying them over could be a problem depending on memory available, right?

Update: Thinking it over, it doesn’t seem it would be too much of an issue in most cases given the kind of datatypes that are stored (it’s not like storing image data in a pandas dataframe and loading it into memory). But would still like to understand how it works in the industry if possible.

It would in principle, but there would probably be ways around it. For instance, you may want to only preserve part of the dataframe, not all of it.

And that’s if you even need to copy it all. In my experience copying is very much and exploratory thing, not a production thing.

1 Like

I hope I am not sidelining from the original discussion now, but do we learn more about the production aspects here in the Dataquests Paths/Courses? If not would you (or anyone from Dataquest) be able to suggest how to get comfortable with those on our own? If you think this is better as a separate question, let me know.

I’d ask it as a separate question. I won’t have answer for you, though. I agree that this is something we would ideally tackle.

Asking the question separately will help bring visibility to this need, hence my recommendation above.

1 Like

Thanks, @Bruno. It makes sense to use copy() if there’s any chance that one wants to preserve the original DataFrame, and that one should do so first. I would recommend SettingwithCopyWarning: How to Fix This Warning in Pandas for anyone who wants to learn more about it.

I don’t mind. I’m glad to see that I initiated some discussion :slight_smile: