Different assignment / access options in Pandas

Hello everyone,

I started doing the third Guided project but I wasn’t getting anywhere, which made me realize I seriously needed to review the pandas course.
I think I got a little confused by all the different library, object type, syntaxes and ended up mixing everything in my head.

I want to discuss a specific topic here : the different ways to assign / access a dataframe in pandas
In the last challenge of mission 381, we have to create two series : industry_usa and sector_china.

I used the following code (taking only USA to shorten the example):

usa_hq = f500[f500[‘country’] == ‘USA’]

industry_usa = usa_hq[‘industry’].value_counts().head(2)

However, if I try to “replace” usa_hq in industry_usa as if it was a variable, I get an error:

industry_usa = f500[f500[‘country’] == 'USA, ‘industry’].value_counts().head(2)

‘Series’ objects are mutable, thus they cannot be hashed

Instead, one must add the .loc method for this code to work:

industry_usa = f500.loc[f500[‘country’] == 'USA, ‘industry’].value_counts().head(2)

Finally the solution shows the following:

industry_usa = f500[“industry”][f500[“country”] == “USA”].value_counts().head(2)

which is a little weird since the double bracket syntax was never spoken of during the course. I thought it was only possible for lists in Python but not in dataframe (early in the course it actually shows that the array selection is “easier” and compares it to the double bracket syntax for lists)

All in all, it looks like you have (at least) 3 ways of getting to the right answer. My problem is that I don’t understand how each way is working and how they differ in the “back-end”. Any chance you guys can help me understand better what’s happening behind all this? And also why ‘replacing’ my usa_hq in industry_usa generates an error, but not when I apply the method directly on the helper variable?

Thanks already!

Please use backticks ` to enclose code or triple backticks ``` (each on newline) to wrap code blocks so single quotations do not get messed up when you paste them here.

DQ uses 0.22 version pandas. You can see it by print(pd.__version__).
Next look at the traceback and trace (using function names or more accurate file line numbers) through the 0.22 pandas code on github:

When you do industry_usa = f500[f500['country'] == 'USA, 'industry'], it thinks f500['country'] == 'USA, 'industry' inside f500[ ... ] is the key (not exactly, it evaluates the expression before the comma to a series first). It then goes through a series of functions

  1. __getitem__ in frame.py
  2. __getitem_column in frame.py
  3. _get_item_cache in generic.py
  4. __hash__ in generic.py

Before it reached 4, it was trying to get the key from cache with res = cache.get(item) (you can see this in the traceback). However, the key was a tuple containing a non-hashable object (series) so the get fails. You can replicate this using your own small experiment

d = {}

TypeError: unhashable type: 'list'
The error message is slightly different because this comes from python itself and has nothing to do with pandas but the concept is similar.

What I don’t understand is why the def __hash__ is hardcoded to/unconditionally raise TypeError. That seems to say every key will be unhashable but how can that be? In the previous step of res = cache.get(item), some of these items must be normal hashable items and should not cause TypeError, so probably my understanding of when def __hash__ is called is weak.

In the newer pandas 1.1.3, the frame.py has been changed and different methods are used to get the key. This leads to the series and my Height indexer forming a valid tuple, as a key, but this key cannot be found, thus showing the invalid key error.

The solution of f500['industry'][f500['country'] == 'USA'] is not double syntax bracket. It is getting a series first using f500['industry'], then boolean filtering this series with another series. In this scenario of simply filtering rows i don’t think theres a speed difference between this and your method of f500.loc[f500['country']=='USA','industry']. They both end up with a series for value_counts to be applied on eventually.
(Why f500.loc[f500['country']=='USA','industry'] works you can trace through the functions and tell us what’s going on)

However, when you are doing groupby aggregations (groupby.apply, groupby.agg), it is better to filter the column first.

df = pd.DataFrame({'Animal' : ['Falcon', 'Falcon',
                               'Parrot', 'Parrot'],
                   'Max Speed' : [380., 370., 24., 26.],

df.groupby('Animal').agg('mean')['Height']      # Number 1
df.groupby('Animal')['Height'].agg('mean')      # Number 2 
df['Height'].groupby(df['Animal']).agg('mean')  # Number 3

All 3 will return the same thing. Number 2 is better because aggregations are only applied to column of interest instead of all columns unnecessarily in Number 1.
See they are different objects.
Number 3 is going overboard and making the syntax more complicated because if you start from df['Height'] which is a series, ('Animal') will be inaccessible, and so the df['Animal'] has to be thrown into groupby() instead of just ('Animal') . This is to demonstrate groupby does not have to only take column name as input, in fact it can take any collection of the same length that isn’t even related to the df you are calling it on.

The time-expensive operation is in the aggregation, so no need to pull hairs over filtering as long as only 1 column is passed just before aggregation and Number 2 is most readable and fast.

1 Like

Hello hanqi! Thanks a lot for your answer - it’s pretty clear altogether and very helpful.

I’m still new to coding and when I get a big traceback error in pandas / numpy I tend to get lost very fast since it refers to back-end code done within the library :frowning: