Data Aggregation python

Applies the GroupBy.agg() method to happy_grouped. 
Pass a list containing np.mean and np.max into the method. 
Assign the result to happy_mean_max.

custom function named dif calculates 
the difference between the mean and maximum values. 
dif is passed into GroupBy.agg() with result to mean_max_dif.
import numpy as np
grouped = happiness2015.groupby('Region')
happy_grouped = grouped['Happiness Score']
def dif(group):
    return (group.max() - group.mean())
happy_mean_max = happy_grouped.agg([np.mean, np.max])
mean_max_dif = happy_grouped.agg(dif)

I do not understand why the mean is subtracted from the mean here.
I have seen calculations like max - min to find the range , but hav e never seen this calculation before , what kind of descriptive stat is it please?

I think it is feature scaling? Check out this link

1 Like

Thanks, it mentioned that that method is more accurate , since if the mean is large much accuracy is lost - I am not sure how to judge at what point it is large or not or what a large mean is ie relatively large to what?

I searched online and saw that if the spread of values in the data set is large , the mean is not as representative of the data as when it’s small, because a large spread indicates that there are probably large differences between individual scores. Nothing returned for large mean though I saw it mentioned in the link : If the π‘₯𝑖

are all close, even if their mean is large, this will be quite small. So I suppose it means mean is large compared to rest of datapoints.

Would you happen to know how the dif function knows which is max and which is mean when it is passed the list since the order does not match the calculation it is passed [a,b] but then it does b - a:

happy_mean_max = happy_grouped.agg([np.mean, np.max]) is passed into here β€”
return (group.max() - group.mean())

def dif(group):
    return (group.max() - group.mean())
happy_mean_max = happy_grouped.agg([np.mean, np.max])
mean_max_dif = happy_grouped.agg(dif)

Thanks

1 Like