Weird value_counts() behavior in Guided Project: Clean And Analyze Employee Exit Surveys

I’m working on Guided Project: Clean And Analyze Employee Exit Surveys.

I’m getting weird behavior with the value_counts() function. After I added values to the (new) dissatisfied column by converting the data in other columns:

def update_col(x):
    if x == '-':
        return False
    elif pd.isnull(x):
        return np.nan
    else:
        return True
    
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 
                                                       'Contributing Factors. Job Dissatisfaction']].applymap(update_col).any(1, skipna=False)


tafe_resignations_up = tafe_resignations.copy()

dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction',
                                                       'dissatisfaction_with_the_department', 
                                                       'physical_work_environment',
                                                       'lack_of_job_security', 
                                                       'work_location',
                                                       'employment_conditions', 
                                                       'work_life_balance',
                                                       'workload']].any(1, skipna=False)
dete_resignations_up = dete_resignations.copy()

value counts gives me a weird result.

tafe_resignations_up['dissatisfied'].value_counts(dropna=False)

returns

False    241
True      91
True       8
Name: dissatisfied, dtype: int64

But when I examine the actual data, the NaN values are still there but seem to be reported as a second type of True for value_counts()?

I am running Jupyter Notebook locally using Numpy 1.18.1. (vs 1.14.2 in DataQuest) and pandas 1.0.1 (vs 0.22.0) but I’ve been unable to find any documentation pointing to why this might be happening.

Any help would be greatly appreciated!

Hello @ebuschang

Will you able to share the screen link for the better understanding of your problem? And also try to follow the technical guidelines while posting the question which will help you to get faster response for your problem.

Thanks
Best
K!

Screen link meaning a screenshot? The technical guidelines specifically said not to include screenshots which was why I tried to use markdown for my code. Since I’m running Jupyter Notebook locally I can’t link to my actual notebook anywhere (please tell me the best way to do that if I can!).

My code

Screenshot of 8 values of NaN marked as True by value_counts()

Hi there! By screen link, I believe @prasadkalyan05 had meant a URL to the mission you’re working on so we don’t have to go searching for it :+1:t4:

You can download an .ipynb copy of your notebook and upload it along with your posts here so we can try to reproduce what you’re running into on our end.

From what I can see, the double True values it may be that the dissatisfied column might not be a boolean dtype. Try using the df.info() method to confirm the dtype of the column – if it’s an object dtype, it’s likely that there’s some trailing whitespace on some of the True values.

1 Like

It seems to not be an issue with the project I’m working on but something I missing with how value_counts() works. I’ve played with it a bit more.

  1. Create a series that is of type bool
s1 = pd.Series([True, False, False])
s1

returns

0     True
1    False
2    False
dtype: bool
  1. I also created a series that has bools and a NaN
s2 = pd.Series([True, False, np.nan])
s2

this is of dtype object, not bool. There are no trailing spaces anywhere as I did not use strings.

0     True
1    False
2      NaN
dtype: object
  1. Perform series.value_counts(dropna=False) on s1
s1.value_counts(dropna=False)

returns the following as expected

True     1
False    1
True     1
dtype: int64
  1. Perform series.value_counts(dropna=False) on s2
s2.value_counts(dropna=False)

but s2 returns a second True value

True     1
False    1
True     1
dtype: int64

What am I missing? I would expect the last output to be

True     1
False    1
NaN     1
dtype: int64

I know NaN gets evaluated as True, which I assume is where this is stemming from but why it is evaluated as a separate row of True? And why is it even evaluated as a bool for a series of dtype object?

numpy 1.18.1 pandas 1.0.1 python 3.7.6

So I finally figure it out. It’s actually a bug in pandas 1.0.1. I upgraded to the latest version (1.0.5) and it resolved the issue. Now

s2.value_counts(dropna=False)

returns

True     1
False    1
NaN      1
dtype: int64

as expected!

Thanks for the help!

2 Likes

Hi! It seems that I´m facing the same issue. Could you please share the link where you found the solution?

The problem is the version of pandas you are using. If you upgrade to the current version of pandas (1.0.5) the issue should resolve.

1 Like

Thanks! It has cost me a bit to update it (dunno why), but it works now :slight_smile: