Loop through dataframe conditionally

I’m trying to loop through a dataframe conditionally. I know that there are more efficient ways to do this, but I’m doing it this way so that I can test it’s speed compared to different methods.

I need to loop through this dataframe ‘cps’ and if the value in the ‘union’ column is equal to ‘Union’ then I need to take the value in the ‘wage’ column and add it to a variable and then divide the value by the total amount of row columns. Basically, I need to get the average wage of everyone who was in a union.

Dataframe columns are:
wage, educ, race, sex, hispanic, south, married, exper, union, age, sector

Here’s what I have so far:

def avg_union_wage_loop(x):
    count = 0
    wageSum = 0
    for row in x['union']:
        if row['union'] == 'Union':
             wageSum = wageSum + x.iloc[row['wage']]
             count = count + 1
    avg = wageSum/count
    return avg

This row throws an error:

 wageSum = wageSum + x.iloc[row['wage']]

and the error that I get is:

 string indices must be integers

I’m not entirely sure what to do next. I guess I’m stuck on how I reference the exact row and column I need to reference to get the wage value so that I can add it to a variable where I can sum it up.

Any help would be appreciated.

1 Like

Hi @charlesd,

Can you please share with us the jupyter notebook file and the required datasets? It will allow us to help you better.

Here I am assuming that x contains a dataframe due to the kind of error you are getting.

x['union'] will return a series. So on each iteration row will store a string. And when you try to perform:

row['union']

The system can throw the following error:

string indices must be integers

Because, to index a string (row), you need to use an integer. However, here you are using a string (‘union’)

Since you are getting error in this line:

wageSum = wageSum + x.iloc[row['wage']]

My assumption could be wrong.

Best,
Sahil

I believe this can be easily done using pandas.DataFrame.groupby and pandas.core.groupby.GroupBy.mean like this:

cps.groupby('union')['wage'].mean()

Best,
Sahil