When to use inplace vs overwritting original dataframe?

While cleaning a dataframe for this course I am instructed to reassign the dataframe to itself after dropping a column. In previous courses they have said that assigning a dataframe back to itself without using .copy() can cause issues, so I am wondering when is it best/appropriate to use each: overwriting vs using inplace.

Screen Link:


combined = combined.drop("REGION_x", axis=1)


combined.drop("REGION_x", axis=1,inplace=True)
1 Like

Hi @anna.strahl,

I think those two are separate issues:

  1. .copy vs no copy
  2. copy and assign vs inplace

I’ll answer the second before the first.

It’s not exactly “overwriting vs using inplace” but “create a copy and then assign back to the original data frame vs modifying immediately”. Both methods overwrite the original data frame but the non-inplace creates a copy before overwriting, while the inplace overwrites immediately.

inplace is ideal when we only want to use one single version of a data frame. For example, if we want to modify just the combined throughout the whole project, yeah, use inplace all the time when available.

Using just inplace is useful for performance reasons as well, because we don’t need to create copies of combined all the time just to assign it back to combined again. If the combined data frame is big, it’s possible that the non-inplace functions/methods to be slower because they’ll need to create a copy of the big data frame first and then modify that. inplace skips all that and directly modifies the data frame.

Related to your question is that most of the time we want different versions of a data frame while at the same time we want to maintain the original data frame. In those situations, inplace is not ideal. We want to create a modified copy of the original and then assign it to a different variable.

With that said, I recently realized that the discussion on the use ofinplace is more nuanced and what I mentioned is what some pandas users believe about inplace. Here’s one StackOverflow thread that touches on the more technical (and possibly harmful) aspect of inplace usage:

The issues they mention is more about preventing the accidental modification of a data frame. Depending on how you index a data frame, the returned data frame might return a copy or a reference.

A copy of a data frame is independent from the original data frame. Modifying it won’t modify the original data frame. The behavior is different for a reference in which modifying the reference will modify the original data frame.

The most common way a copy is returned is when you use chaining especially when indexing.

Let’s use this Dataquest practice screen as an example. Below is a chained assignment:

people[people.Name == 'Jin']['Age'] = 10

The above will not modify the people data frame because the chained indexing returns a copy thus only the copy is modified.

To modify the people data frame, you’ll need to use only a single indexing operation which can be done with .loc:

people.loc[people.Name == 'Jin', 'Age'] = 10

Every indexing should be done inside that single loc if you want to modify the reference, thus the following won’t work:

people.loc[people.Name == 'Jin']['Age'] = 10

But chaining can manifest itself as hidden chaining.

I’m going to use a different example which is the eBay project. Consider the following code:

autos = pd.read_csv("autos.csv", encoding="Latin-1")
privat_autos = autos[autos['seller'] == 'privat']
privat_autos['seller'] = 'public'

On first glance, it seems that no chaining had happened. But chaining can happen across two lines. In the above, privat_autos becomes a copy because of the hidden chaining. As a result, it will be modified but autos will not.

That might be what you want, but in other situations you might want privat_autos to be a reference so that any change to privat_autos should also affect autos. pandas does not know your intentions so that’s why there’s a need for .copy to make your intention explicit. When you use .copy, pandas will know for certain that a copy is being returned so they do not need to worry if you are accidentally modifying a copy when you actually wanted to modify a reference.

Below, we’re making it clear to pandas that we want and know that any modification to privat_autos will not affect autos:

privat_autos = autos[autos['seller'] == 'privat'].copy()
privat_autos['seller'] = 'public'

The SettingWithCopyWarning is a warning and not an error as a result of my previous point. It’s warning pandas users to be clear about what they’re trying to do, but it won’t stop them from using pandas when the warning is ignored. Ignoring it might be fine in some projects which use a lot of copies; but if you want to use references, it’s best not to ignore the warning because you might be modifying copies instead of references.