Pandas df.corr - one variable across multiple cols error - World Happiness

I am trying to edit code in the above link (below is my attempt) to find pairwise correlations between
one variable ‘happiness score’ and these columns
‘ECONOMY GDP PER CAPITA’, ‘FAMILY’,
‘HEALTH LIFE EXPECTANCY’, ‘FREEDOM’, ‘TRUST GOVERNMENT CORRUPTION’,
‘GENEROSITY’
taken from the mission
World Happiness — Working With Missing And Duplicate Data
what I am trying to do has nothing to do with the mission though.

At the end I want to generate a heatmap, but first need to get the correlation table right.

I get errors when I run this code:
KeyError: ‘HAPPINESS SCORE’

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in
4 get_ipython().run_line_magic(‘matplotlib’, ‘inline’)
5
----> 6 df_corr = combined[combined.columns[5:11]].corr()[‘HAPPINESS SCORE’]

This is output of combined.columns
Index([‘COUNTRY’, ‘REGION’, ‘HAPPINESS RANK’, ‘HAPPINESS SCORE’,
‘STANDARD ERROR’, ‘ECONOMY GDP PER CAPITA’, ‘FAMILY’,
‘HEALTH LIFE EXPECTANCY’, ‘FREEDOM’, ‘TRUST GOVERNMENT CORRUPTION’,
‘GENEROSITY’, ‘DYSTOPIA RESIDUAL’, ‘YEAR’, ‘LOWER CONFIDENCE INTERVAL’,
‘UPPER CONFIDENCE INTERVAL’, ‘WHISKER HIGH’, ‘WHISKER LOW’],
dtype=‘object’)

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df_corr = combined[combined.columns[5:11]].corr()['HAPPINESS SCORE']

df_corr
#fig, ax = plt.subplots(figsize=(30,25))

#sns.heatmap(df_corr.to_frame(),annot=True, annot_kws={'size':12},cmap="GnBu")
#plt.show();```

Look and the data selection here, HAPPINESS SCORE is not within this range. the HAPPINESS SCORE is in the 3rd index.

Code
df_corr = combined.corr()['HAPPINESS SCORE']['ECONOMY GDP PER CAPITA', 'FAMILY',
'HEALTH LIFE EXPECTANCY', 'FREEDOM', 'TRUST GOVERNMENT CORRUPTION',
'GENEROSITY']
1 Like

Thanks,

would you know how to select the ‘HAPPINESS SCORE’?
I have a problem because it is not part of the columns 5:11
and as you mentioned need to include it in the selection
df_corr = combined[combined.columns[5:11,3]].corr()['HAPPINESS SCORE'][:-3]
,trying this I get the error:
‘too many indices for array’

I am guessing I can write out all the columns I 'd like in the correlation matrix
but it makes the code lengthier –

actually I tried that below and still got the key error , which means it cannot find dictionary key (or index out of range if it is an array which I suppose a dataframe is since it consists of series) even though it is not a dictionary but I suppose it means it cannot find [‘HAPPINESS SCORE’]?

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#df_corr = combined[combined.columns[5:11,3]].corr()['HAPPINESS SCORE'][:-3]
df_corr = combined['HAPPINESS SCORE','ECONOMY GDP PER CAPITA', 'FAMILY',
       'HEALTH LIFE EXPECTANCY', 'FREEDOM', 'TRUST GOVERNMENT CORRUPTION',
       'GENEROSITY'].corr()['HAPPINESS SCORE'][:-3]
df_corr
#fig, ax = plt.subplots(figsize=(30,25))

#sns.heatmap(df_corr.to_frame(),annot=True, annot_kws={'size':12},cmap="GnBu")
#plt.show();

I think you got the key error due to syntax. Try this:

df_corr = combined[['HAPPINESS SCORE','ECONOMY GDP PER CAPITA', 'FAMILY',
       'HEALTH LIFE EXPECTANCY', 'FREEDOM', 'TRUST GOVERNMENT CORRUPTION',
       'GENEROSITY']].corr()['HAPPINESS SCORE'][:-3]
1 Like

Try storing it in a variable, like this:

selected_columns = ['HAPPINESS SCORE','ECONOMY GDP PER CAPITA', 'FAMILY',
       'HEALTH LIFE EXPECTANCY', 'FREEDOM', 'TRUST GOVERNMENT CORRUPTION',
       'GENEROSITY']
1 Like

Thanks mathmike,

Excellent. It worked.

This is a slightly different question , but I am just a bit stuck on how to remove ‘happiness score’ correlated with itself
which is the coefficient of 1 at the top here
happy score

In above link it was mentioned to do as follows

if you remove the column selection in the end you’ll get a correlation matrix of all other columns you are analysing. The last [:-1] is to remove correlation of ‘special_col’ with itself.

so I tried like this but I still have the Pearson correlation of 1 up at top

I put [:1] as ‘happiness score’ is first column here
[‘HAPPINESS SCORE’][:1]

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(figsize=(8,8))

#df_corr = combined[combined.columns[5:11,3]].corr()['HAPPINESS SCORE'][:-3]
#df_corr = combined['HAPPINESS SCORE','ECONOMY GDP PER CAPITA', 'FAMILY',
 #      'HEALTH LIFE EXPECTANCY', 'FREEDOM', 'TRUST GOVERNMENT CORRUPTION',
  #     'GENEROSITY'].corr()['HAPPINESS SCORE'][:-3]

df_corr = combined[['HAPPINESS SCORE','ECONOMY GDP PER CAPITA', 'FAMILY',
       'HEALTH LIFE EXPECTANCY', 'FREEDOM', 'TRUST GOVERNMENT CORRUPTION',
       'GENEROSITY']].corr()['HAPPINESS SCORE'][:1]

#df_corr


sns.heatmap(df_corr.to_frame(),annot=True, annot_kws={'size':12},cmap="GnBu")
plt.show();
> In [15]: data[data.columns[1:]].corr()['special_col'][:-1]
> ```

Ok…I think I can picture your setup, but let me confirm:

In the link you’re following, their column that results in r = 1.0 is the last column so they suggest slicing with [:-1] so that you select everything except the last element. Whereas your setup has ‘Happiness Score’ (ie the column generating r = 1.0) in the first position? Is this correct?

Assuming I’ve understood your setup & question correctly, you will want to slice with [1:] because you want to skip [0] which is HAPPINESS SCORE whereas the example you’re following used [:-1] in order to skip [-1] because “their’s” was at the end while yours is at the beginning.

Is this making sense? Am I getting hungry? Should I eat? :laughing:

1 Like

Hi mathmike,

hscore2

Yes it makes sense as it was explained very clear and logically.
Thanks and it does what I want it to look like.

I saw the explanation before of slicing with rows start from one like [1:] , but I am not sure why it did not appear to me to be the solution and my problem was not doing like that, but selecting wrongly [:1]

Indeed should eat!
Regards
JB

1 Like

Strangely enough, as I was reading this, it came to me why you (probably) did that: it’s the first kind of slicing we were taught…cut off those headers! :laughing:

So many of the first few times we used slice indexing was to remove the header row from lists of lists or ndarrays.

Once again, always a pleasure to help! :sunglasses:

1 Like

There is two way i can think of
First, append

selected_columns  = combined.columns[5:11].append(combined.columns[[3]])

and second using numpy's np.r_

import numpy as np
selected_columns = combined.columns[np.r_[3,5:11]]
1 Like

Many thanks DishinGoyani

1 Like