Is there a work around with get_group() throwing error while trying to select data for a set of groups?

Screen Link: https://app.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/8/storing-aggregate-data-in-a-dataframe

I am not actually “stuck” but ran into a limitation with pandas get_group(). while trying to replace the not-so-optimal looping method with groupby method( after completing the mission on it ) for aggregating on brand in the Guided Project Exploring Ebay Car Sales Data.

The issue is we don’t want to aggregate by the entire set of brands in the dataset but use a selective list of brands. However, get_group doesn’t seem to allow sub-setting the data based on more than one group:

data_for_vw_only = autos.groupby('brand').get_group(‘volkswagen’)   => WORKS
data_for_vw_ford = autos.groupby('brand').get_group([‘volkswagen’, 'ford'])   => GIVES ERROR

After some looking around, I have this code, which works, but in the end, there is still a for loop. So, I am wondering if there’s another efficient Pandas way. I am yet to learn

My Code:

grouped = autos.groupby('brand')
subset_df = pd.concat([grouped.get_group(name) for name in brands_gt_100])
# Where brands_gt_100 is a list of brands  that have atleast 100 listings
brand_price_df = pd.DataFrame(subset_df.groupby('brand')['price'].mean()).sort_values('price',ascending=False)

What I expected to happen:

subset_of_brands_data = autos.groupby('brand').get_group(brands_gt_100)   # Should have worked

What actually happened:

TypeError: unhashable type: 'list'
# when converted to a tuple, gives similar  error
KeyError: ('volkswagen', 'bmw', 'opel'.........)
1 Like

Hi @DnaData,
the get_group works like a dictionary, with input being one key and the output one dataframe in this case.

That’s probably why the shortcut doesn’t work - get_group accepts only a single key value.

You could create a filtered autos subset, based on brands_gt_100, e.g.:

filter_condition = autos.loc[:, "brand"].isin(brands_gt_100)
autos_brands_gt_100 = autos[filter_condition]

Then use the autos_brands_gt_100 in place of subset_df in the above example.

So, I would go with isin() method to filter the dataset, then use groupby, mean, etc.

Hope this helps. :slight_smile:

1 Like

Thanks @kakoori for taking the time to respond to my question. Much appreciated.
I do agree, the alternative and most convenient thing to do here is to first slice the dataset and then apply the groupby on the subset. I was wishing Pandas may have introduced something new like get_groups :slight_smile:

BTW, thanks also for the . isin() construct…very nice thing to know

1 Like