Duplicate Values for 'dissatisfied' (not the pandas bug!)

I was about to start the Initial Analysis step of this guided project.

However, when I checked the values of the dissatisfied column, I saw that there were duplicate True and False lines. I’ve searched for similar questions, but it seems like others are just having True instead of NaN which is not what’s happening for me.

combined_updated['dissatisfied'].value_counts(dropna=False)

What I expected to happen was these 3 values with counts:
False
True
NaN

What actually happened:

False    277
False    162
True     149
True      55
NaN        8
Name: dissatisfied, dtype: int64

It seems that it’s counting the TAFE and DETE dissatisfied values separately. I’m not sure why that would be, since I used the same function on both.

def update_vals(val):
    if pd.isnull(val):
        return np.nan
    elif val == '-':
        return 'False'
    else:
        return 'True'

Here’s the code for TAFE :

tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 
                                                       'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(axis=1, skipna=False)    

tafe_resignations_up = tafe_resignations.copy()
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)

And the output:

False    277
True      55
NaN        8
Name: dissatisfied, dtype: int64

And here’s code for DETE:

dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction',
    'dissatisfaction_with_the_department',
    'physical_work_environment',
    'lack_of_recognition',
    'lack_of_job_security',
    'work_location',
    'employment_conditions',
    'work_life_balance',
    'workload']].any(axis=1, skipna=False)

dete_resignations_up = dete_resignations.copy()
dete_resignations_up['dissatisfied'].value_counts(dropna=False)

And output:

False    162
True     149
Name: dissatisfied, dtype: int64

I can’t figure out why they’re not being combined correctly. In case it’s relevant, here’s the code I used to combine the datasets:

combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index=True)

Any help would be very much appreciated!

Hi @amy.basque and welcome to the community!

I’m not 100% sure but I believe you’re getting these duplicates because your function (update_vals(val)) is returning string values rather than the booleans True and False. Try removing the quotes from your return lines inside your function definition.

Are you sure you’re using the function on both datasets? It looks like you only used it on TAFE whereas with DETE you just used the any() function. This would explain why you’re getting the doubling up of True/False lines since TAFE is returning string values (due to the quotation marks around True/False within your function) while DETE is giving boolean True/False values from the any() function. Although they look the same when printed using value_counts(), a boolean True and a string "True" are fundamentally different and therefore will be printed on separate lines.

Let me know if this is too confusing or doesn’t explain your results and we can try something else.

EDIT: I just noticed that your results for TAFE (ie False: 277, True: 55) corresponds only to the 'Contributing Factors. Dissatisfaction' column. Hopefully after you implement the above changes to your function you’ll get the desired result (ie False: 241, True: 91, NaN: 8). I suspect that will be the case because right now, you’re asking any() to evaluate on string values instead of on booleans.

1 Like

Thank you! Yes, those were the two issues I had. I’m now getting exactly the results I’d expected. I really appreciate the speedy and thorough response.