How to apply a function in a column with string and integer data types

Hi,

In the ‘Visualizing Frequency Distributions’ course. Part 2. Bar plots. It says to generate a frequency table for the Exp_ordinal variable. The Exp_ordinal are labels based on ranges in the ['Experience'] column.

I am doing the exercises in a local environment; therefore I have to create those labels. I tried this code:

def experience_labels(row):
     if row['Experience'] == 'R':
        return 'Rookie'
     if ( 1 < row['Experience'] <= 3):
        return 'Litte experience'
     if ( 4 < row['Experience'] <= 5):
        return 'Experienced'
     if ( 5 < row['Experience'] <= 10):
        return 'Very experienced'
     if row['Experience'] >= 10:
        return 'Veteran'
    

wnba['Experience labels'] = wnba.apply(experience_labels, axis = 1)

I can’t apply that function because the ['Experience'] column has both string and integer data types.
I get the error:

'<' not supported between instances of 'int' and 'str'

How can I apply that function to a column with both string and integer data types?

Thank you in advance.

Hello,

You can first transform R to 0, and convert the column to int. Then you use your function to transform 0 to Rookie. I also suggest you to use Series.apply() instead of DataFrame.apply(). This is how I’d do it:

wnba['Experience labels'] = wnba['Experience labels'].str.replace('R', '0').astype(int)

def experience_labels(row):
     if row['Experience'] == 0:
        return 'Rookie'
     if ( 1 < row['Experience'] <= 3):
        return 'Litte experience'
     if ( 4 < row['Experience'] <= 5):
        return 'Experienced'
     if ( 5 < row['Experience'] <= 10):
        return 'Very experienced'
     if row['Experience'] >= 10:
        return 'Veteran'
    
wnba['Experience labels'] = wnba['Experience labels'].apply(experience_labels)
1 Like

Hi,

I made some mods to the code and I still get an error:

TypeError: 'int' object is not subscriptable

The code that I wrote is:

wnba['Experience'] = wnba['Experience'].replace('R', '0').astype(int)

def experience_labels(row):
     if row['Experience'] == 0:
        return 'Rookie'
     if ( 1 < row['Experience'] <= 3):
        return 'Litte experience'
     if ( 4 < row['Experience'] <= 5):
        return 'Experienced'
     if ( 5 < row['Experience'] <= 10):
        return 'Very experienced'
     if row['Experience'] >= 10:
        return 'Veteran'

wnba['Experience labels'] = wnba['Experience'].apply(experience_labels)

Which line of the code is causing the error?

That’s probably because you are using row['Experience'] inside the function. Since the function is being applied element-wise in a Series, you do not have to specify the name of the column. I forgot to change this in the code. This should word:

wnba['Experience'] = wnba['Experience'].replace('R', '0').astype(int)

def experience_labels(row):
     if row == 0:
        return 'Rookie'
     if ( 1 < row <= 3):
        return 'Litte experience'
     if ( 4 < row <= 5):
        return 'Experienced'
     if ( 5 < row <= 10):
        return 'Very experienced'
     if row >= 10:
        return 'Veteran'

wnba['Experience labels'] = wnba['Experience'].apply(experience_labels)
1 Like

Yes, the code works now.

The first code (where column name is defined in the function) I got from ‘Frequency Distributions’ course in activity ‘2. Sorting Tables for Ordinal Variables’.

It specifies the column name in each conditional inside the function, but in the end, the function is applied to the whole data set so it’s a DataFrame and not a Series right?

def make_pts_ordinal(row):
    if row['PTS'] <= 20:
        return 'very few points'
    if (20 < row['PTS'] <=  80):
        return 'few points'
    if (80 < row['PTS'] <=  150):
        return 'many, but below average'
    if (150 < row['PTS'] <= 300):
        return 'average number of points'
    if (300 < row['PTS'] <=  450):
        return 'more than average'
    else:
        return 'much more than average'
    
wnba['PTS_ordinal_scale'] = wnba.apply(make_pts_ordinal, axis = 1)

Yes, exactly. When you apply it to a DataFrame, the function receives an entire row, that’s why you need to specify the column name.

1 Like