BLACK FRIDAY EXTRA SAVINGS EVENT - EXTENDED
START FREE

Trying to sort years into categories for Analyzing Employee exit surveys

Screen Link:
https://app.dataquest.io/c/60/m/348/guided-project%3A-clean-and-analyze-employee-exit-surveys/7/identify-dissatisfied-employees

My Code:

dete_resignations.loc[dete_resignations["institute_service"] <= 1, "institute_service"] = 'Less than 1 year' 
dete_resignations.loc[dete_resignations["institute_service"] == float & dete_resignations["institute_service"] >= 1.0 & dete_resignations["institute_service"] <= 2.0, "institute_service"] = '1-2'
dete_resignations.loc[dete_resignations["institute_service"] >= 3 or <= 4, "institute_service"] = '3-4'
dete_resignations.loc[dete_resignations["institute_service"] >= 5 or <= 6, "institute_service"] = '5-6'
dete_resignations.loc[dete_resignations["institute_service"] = 7 & <= 10, "institute_service"] = '7-10'
dete_resignations.loc[dete_resignations["institute_service"] >= 11 & <= 20, "institute_service"] = '11-20'
dete_resignations.loc[dete_resignations["institute_service"] >= 20, "institute_service"] = 'More than 20 years'

What I expected to happen: I expected the number values from the DETE survey to be assigned to the respective categories

What actually happened: I get a TypeError message

TypeError: '<=' not supported between instances of 'str' and 'int'

Before this block of code, the values in dete_resignations[‘institute_service’] were float types. Any ideas on how to solve this problem would be much appreciated

Hi @FCPen and welcome to the community!

I’m not sure if know why you’re getting this error message or not so I apologize for this next bit of text if you do!

The reason for the error is clear: python does not know how to compare strings against integers using <=. But as you said, prior to this code block, the values were all integers so where did these strings come from?! Well…they came from this block of code! The very first line introduces string values (namely: 'Less than 1 year') after filtering the data based on the institute_service column. The problem is, we are repeatedly checking this column for integer values while simultaneously introducing string values to this column.

A quick fix would be to assign the string values to a different column than the institute_service column so that the integer values remain intact in that column. Try assigning your string values (e.g. '3-4', 'More than 20 years', '7-10', …) to a new column like service_label.

1 Like

Thank you very much @mathmike314 ; I’ll give that a try.

1 Like

Hi,

I tried creating a separate column and assigning the categories to that, and that’s somewhat helped, but it still isn’t coming out with what I would expect.

dete_resignations['service_label'] = dete_resignations['institute_service'].astype(str)

dete_resignations.loc[dete_resignations["service_label"] <= '1.0', "service_label"] = 'Less than 1 year' 
dete_resignations.loc[(dete_resignations["service_label"] >= '1.0') & (dete_resignations["service_label"] <= '2.0'), "service_label"] = '1-2'
dete_resignations.loc[(dete_resignations["service_label"] >= '3.0') & (dete_resignations["service_label"] <= '4.0'), "service_label"] = '3-4'
dete_resignations.loc[(dete_resignations["service_label"] >= '5.0') & (dete_resignations["service_label"] <= '6.0'), "service_label"] = '5-6'
dete_resignations.loc[(dete_resignations["service_label"] >= '7.0') & (dete_resignations["service_label"] <= '10.0'), "service_label"] = '7-10'
dete_resignations.loc[(dete_resignations["service_label"] >= '11.0') & (dete_resignations["service_label"] <= '20.0'), "service_label"] = '11-20'
dete_resignations.loc[dete_resignations["service_label"] > '20.0', "service_label"] = 'More than 20 years'
dete_resignations.loc[dete_resignations["service_label"] == np.nan, "service_label"] = np.nan

dete_resignations['service_label'].value_counts().sort_index(ascending=True)

What I expected to happen:

Less than 1 year: 20
1-2: 36
3-4: 36
5-6: 40
7-10: 41
11-20: 57
More than 20 years: 43
NaN: 38

What I got:

1-2 70
11-20 7
More than 20 years 234

I suspect it’s because with strings, ‘19.0’ is less than ‘2.0’. Any help would be much appreciated.

Unfortunately you have transferred the problem to our new coding strategy: you are using the same column to compare values as you are for storing the new labels.

Yes, that’s exactly what’s happening! That’s why we want to use the institute_service column for comparisons and our new column to store our string labels so that we don’t trip over our own feet.

To sum up, try coding it so that you are using the institute_service column to compare values (i.e. <= 1 etc…) and then store the labels (i.e. ‘1-2’ or ‘7-10’) in the new column service_label.

I’m not sure how exactly to do that; do you have any tips?

Thank you for your help

Hi again,

I’m not sure what’s wrong with this code; if anyone could tell me what’s wrong with this code than that would be much appreciated.

dete_resignations['service_label'] = 0

dete_resignations.loc[dete_resignations["institute_service"] <= 1.0, "service_label"] = 'Less than 1 year' 
dete_resignations.loc[(dete_resignations["institute_service"] >= 1.0) & (dete_resignations["institute_service"] <= 2.0), "service_label"] = '1-2'
dete_resignations.loc[(dete_resignations["institute_service"] >= 3.0) & (dete_resignations["institute_service"] <= 4.0), "service_label"] = '3-4'
dete_resignations.loc[(dete_resignations["institute_service"] >= 5.0) & (dete_resignations["institute_service"] <= 6.0), "service_label"] = '5-6'
dete_resignations.loc[(dete_resignations["institute_service"] >= 7.0) & (dete_resignations["institute_service"] <= 10.0), "service_label"] = '7-10'
dete_resignations.loc[(dete_resignations["institute_service"] >= 11.0) & (dete_resignations["institute_service"] <= 20.0), "service_label"] = '11-20'
dete_resignations.loc[dete_resignations["institute_service"] > 20.0, "service_label"] = 'More than 20 years'
dete_resignations.loc[dete_resignations["institute_service"] == np.nan, "service_label"] = np.nan

dete_resignations['service_label'].value_counts().sort_index(ascending=True)

The error message that I got was TypeError: ‘<’ not supported between instances of ‘str’ and ‘int’.

If someone could suggest an elegant way of achieving what I would like to achieve, i.e. the number of years worked being put into categories like the ones that are in the institute_service column in the TAFE survey.

If you’re still getting this error it’s probably because dete_resignations["institute_service"] still contains string values. Try running all cells above the one you’re working on to make sure this column is filled with numerical values only.

The way I remember doing this task was to use a helper function and then apply it to the column. Here is what that looks like in the Solution Notebook for this project:

# Convert years of service to categories
def transform_service(val):
    if val >= 11:
        return "Veteran"
    elif 7 <= val < 11:
        return "Established"
    elif 3 <= val < 7:
        return "Experienced"
    elif pd.isnull(val):
        return np.nan
    else:
        return "New"
combined_updated['service_cat'] = combined_updated['institute_service_up'].apply(transform_service)
1 Like

Hi @mathmike314,

Thank you for your help and the code you’ve written there. As of now, the code I wrote above minus the

line seems to work.

1 Like