Double `True` counts and Pivot Table for Guided Project: Clean and Analyze Employee Exit Surveys

Hi everyone!

I actually have two concerns regarding the Guided Project: Clean and Analyze Employee Exit Surveys

1st Concern

The first has to do with the code in this mission screen for Identifying Dissatisfied Employees. I went through @golden5mk’s post on the df.any() method. Contrary to what he mentioned about the df.applymap() affecting the behavior of the df.any(), the case isn’t the same for me. At least not entirely.

When I input the following code:

def update_vals(val):
    if val == '-':
        return False
    elif pd.isnull(val):
        return np.nan
    else:        
        return True
    
tafe_cols = ['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']
tafe_resignations['dissatisfied'] = tafe_resignations[tafe_cols].applymap(update_vals).any(axis=1, skipna=False)
tafe_resignations['dissatisfied'].value_counts(dropna=False)

The following output gets displayed:

False    241
True      91
True       8
Name: dissatisfied, dtype: int64

I get a double count for True where the second one should be NaN. So the df.applymap() does seem to have an effect but not as what @golden5mk mentioned. The corresponding values are the same as those in the solution notebook. I read somewhere that I should check my version of Python which happens to be the latest at 3.8.5.

I just want to know why the double count is happening and how I can address it.

2nd Concern

The second has to do with creating a pivot table for the initial analysis. There are a few posts raising similar concerns. In my case, I dropped all rows which had NaN under the dissatisfied column so that when I input the following code:

combined_updated['dissatisfied'].value_counts(dropna=False)

The output is:

False    372
True     226
Name: dissatisfied, dtype: int64

However, when I try creating a pivot table,

dissatisfied_service = combined_updated.pivot_table(values='dissatisfied',index='service_cat',margins=True)

I get a No numeric types to aggregate error. It isn’t until I use the df.fillna() method that my problem gets solved.

combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False)

It’s very strange since, as can be seen above, there are no NaN values to be replaced. There are only 372 False values and 226 True values. Also, when I input the df.value_counts() method just like I did above after using the df.fillna() method, the output is exactly the same for the False and True values. It’s like nothing happened but I’m somehow able to create a pivot table after.

I apologize for the lengthy post but this has been bothering me for quite some time already.

Here’s a copy of my project so you guys can check out the issues I just mentioned. Employee Exit Survey.ipynb (173.2 KB)

Hoping to get a response and thanks in advance for the help!

Click here to view the jupyter notebook file in a new tab

Welcome to the DataQuest Community @AJBrillantes

Please don’t be. This is one of the most elaborate and detailed questions I have come across.

I have taken the liberty to directly work on your notebook. I will cite the code cells directly here. These are my observations:

  • 1st query - code cell 33 - your code is working as desired on my laptop. False = 241, True = 91, NaN = 8.

  • for 2nd query please check code cells from 45 to 47 - I added aggfunc(np.sum) and it aggregates the sum of True counts. maybe there’s is nothing to average or mean out for the dissatisfied column as it’s only True and False (1 & 0), so it gives an aggregation error.

  • I also got keyerror in code cell 43 but that’s unrelated to this issue at handEmployee Exit Survey.ipynb (134.9 KB)

I also added encoding while reading the file(s), in your code in code cell 1.

hey @AJBrillantes and @golden5mk

I checked the Github repo for Pandas. A likely reason for this behavior of dataframe.any() method may not be dataframe.applymap() method, but a bug related to bool_only parameter which takes the default value None in dataframe.any() method. The bug is that, this method is ignoring this parameter altogether.

You can check out the GitHub issue discussion here:

  • open issue which has been escalated or highlighted to the former issue here

I experimented with various combinations in the attached Any_Experiments.ipynb file.

Kindly extend the scenarios that you can think of. I am contemplating whether to add this as a comment to the open issue at GitHub.

Please note I am not saying that is the cause - I am just saying it could be.

@Bruno/ @hanqi any inputs you would like to add. (also about this :point_up:). Thanks. Any_Experiments.ipynb (16.5 KB)

Click here to view the jupyter notebook file in a new tab

Like Rucha, I’m unable to reproduce the first issue. Any chance you can do a screen cast of this behavior happening to you?

The second issue happens because prior to filling in the non-existing missing values, the type of that column is object. It happens to be the case that pandas.DataFrame.fillna modifies the type of the column, as per the downcast parameter:

image

I’m a little confused. Is this about a different topic (specifically this one)?

hi @Bruno

Yes. I am relating the odd behavior of df.any() method with @golden5mk’s issue

In simplest terms, the documentation for df.any() method cites, that if an entire row/ column is having np.nan as values then it would result in True as np.nan != 0. But then we are not getting that True. For example, the last row in the df2 dataframe in my attached notebook >> Any_Experiments.ipynb.

snap shot for it. code cell [5] image

I might be mistaken, and if I am then I want to know why and where am I going wrong here.

Hello @AJBrillantes,

1st concern, I also came across the double count for True while working on the project not until I upgraded my pandas to the latest version.

I think it is a bug in pandas 1.0.1. Upgrade to the latest version (1.0.5).

2nd concern, as @Rucha explained in his reply,

include this parameter aggfunc=lambda x: np.sum(x)/len(x) to Calculate the percentage of employees who resigned due to dissatisfaction in each category as instructed in the guided project

dissatisfied_service = combined_updated.pivot_table(values='dissatisfied',index='service_cat',margins=True, aggfunc= lambda x: np.sum(x)/len(x))

@AJBrillantes In view of this, can you confirm what your pandas version is?

You can do this by printing pandas.__version__ (or pd.__version__ if you use an alias).

I’m on version 1.0.3 and I can’t reproduce this behavior.

1 Like

I think either the documentation is poor or this is a bug.

What I’m seeing is that what is returned is such that if you change the type to bool, then it will give the desired result. It feels like they forgot something like result.astype(bool) before returning.

I suggest you open an issue here and see what they developers have to say.

1 Like

hey @Bruno

That is what I thought too. My second post in this topic, details the already created bug report/ issue at GitHub repo for Pandas.

I wanted your help in sort of confirming that this behavior can be classified as a bug/ issue. I guess I will add this to the Pandas bug list!

@AJBrillantes
My little brain’s best suggestion for you right now is:

  1. don’t use any() method, we don’t know how exactly it will behave even if we add bool_only = False parameter.

  2. take one row at a time from the dataframe (iterrows() or apply() or whichever works best for you)

  3. in Each of these rows:
    - if count of True >= 1, mark dissatisfied col as True,
    - (if you wish to) elif check np.nan >= 1, mark them as True/ False
    - else mark False. (employee left for any other reason but dissatisfaction)

In short, try to recreate the functionality of the df.any() method using a custom function.
And please share that function, I will copy-paste in my project too :stuck_out_tongue_winking_eye:!

1 Like

Hello everyone! @Rucha @Bruno @doyinsolamiolaoye

Thank you so much for your responses. It took me a while to get back to this. I’ve read through all your replies and here’s the situation on my part so far:

1st Concern

I tried doing @doyinsolamiolaoye’s suggestion to update pandas to the latest version and it seems that a new problem surfaced for me :sweat_smile: Running python --version and pip freeze through the Command Prompt in Windows 10, I get the following outputs, respectively (screenshots):

image
image

But when I run pandas.show_versions() in Jupyter Notebook, I get the following:

image

They’re showing different versions for python and pandas. This is probably why I’m still getting Double True Counts even if I’ve already updated to the latest pandas version (1.1.0)

Though I’ll try @rucha’s suggestion on creating a custom function as well. If ever the Double True Count behavior is a bug that has to do with the pandas version, then I guess I’ll have to address this new problem first. Should I create a new post for this?

2nd Concern

@doyinsolamiolaoye’s suggestion to use a lambda function works perfectly!

@Rucha as for this:

I just followed the instruction that can be found in this mission screen

:

So I don’t think an aggregation error should occur? Or maybe the instructions need to be changed to @doyinsolamiolaoye’s suggestion? Not really sure.

@Bruno also mentioned the downcast parameter.

But shouldn’t the parameter only carry out its behavior if I pass in an argument? In my project, I didn’t pass in any argument.

Thanks much and I’m happy with the engagement in this post :grin:

1 Like

Please wait @doyinsolamiolaoye I will reply to this.

1 Like

hi @AJBrillantes

I was re-working on this project already, but since you posted this issue I had to now double-check on almost everything. So Thanks to you too :+1:

I will mix both your concerns here. See the problem is we don’t know how df.any() is behaving and in your case we have it chained right now with the df.applymap() method (which is working fine).

In the any_experiments.ipynb file, I have not used the df.applymap() method at all and still, df.any() is giving unpredictable outputs. So we can red flag this method alone.

It’s a lame guess but it might be that @Bruno mentioned the downcast parameter, as by that time he hadn’t fully tested the issue with df.any() method. He did so in parallel or during my interaction with him and after that, he did confirm my doubts about the bug (our later banter in the post).

I have raised the df.any() method’s odd behavior as a bug with the python-dev/pandas team. Let’s see what they say.

Moving on to your project. I created a function to replicate the df.any() functionality. I need/ want you to perform these tests in your notebook and share your results here.
You may have to change the dataframe or column names according to your project.

Code for replicating df.any() behavior (after df.applymap() method assuming you have np.nan as np.nan):

# dissatisfied columns
tafe_diss_cols = ["Contributing Factors. Dissatisfaction", "Contributing Factors. Job Dissatisfaction"]

dete_diss_cols = ["job_dissatisfaction", "dissatisfaction_with_the_department", "physical_work_environment", 
                  "lack_of_recognition", "lack_of_job_security", "work_location", "employment_conditions", 
                  "work_life_balance", "workload" ]

# dataframe.any() method gives unpredictable outputs for empty rows and columns. 
# function classify_exit replicates the behavior of any() method. np.nan values have been classified as False
def classify_exit(row):
    elements = len(row)
    # atleast one True
    if [each for each in row].count(True) >= 1:
        return True
    # everything is False
    elif [each for each in row].count(False) == elements:
        return False
    # this you will have to change according to test scenario
    else:
        return np.nan

# to avoid warning
tafe_resignations = tafe_resignations.copy()
dete_resignations = dete_resignations.copy()

# apply classify_exit function to classify exits due to dissatisfaction from others
tafe_resignations["dissatisfied"] = tafe_resignations.loc[:, tafe_diss_cols].apply(classify_exit, axis = 1)
dete_resignations["dissatisfied"] = dete_resignations.loc[:, dete_diss_cols].apply(classify_exit, axis = 1)

# display value_counts
print("Resigned due to dissatisfaction TAFE:")
print(tafe_resignations["dissatisfied"].value_counts(dropna=False),"\n")

print("Resigned due to dissatisfaction DETE:")
print(dete_resignations["dissatisfied"].value_counts(dropna=False),"\n")

Final output I want to know:

dete_tafe_combined.pivot_table(values='dissatisfied',index='experience_level',margins=True)


Test1: keep the else: return np.nan as is, and try the final code, it should give you aggregation error.

Test2: keep the else: return np.nan as is, use a fillna method like so:
# dete_tafe_combined["dissatisfied"].fillna(False, inplace = True)

and then use the final code. this should give you decimal values.

Test3: change else: return False, then try the final code directly (no need for fillna()) This should also give you values in decimals.

In all the 3 cases, I have not used aggfunc or lambda.
So let’s see how your projects shows the results now.

Yes. It didn’t carry out this behavior, so it kept being an object (which isn’t numeric, hence the error).

My guess is that you have two different installations. One is the one from Anaconda, the other — I suspect — is from pip. How did you update pandas? I suggest running conda update --all in the command line and trying again.

2 Likes

ohkay. This is crazy. If version is the actual problem. mine looks like this:

image

Hello @AJBrillantes ,

As @Bruno advised,

Since you are running the codes in your Jupyter notebook, you have to update the pandas version in the Jupyter Notebook.

You should get the desired result afterwards.

Cheers.

Hello! @Rucha @Bruno @doyinsolamiolaoye

@Bruno I updated pandas through the Windows command prompt. I tried your suggestion below and updated in Jupyter Notebook as @doyinsolamiolaoye said:

Now the output is as expected! I get the following already:

False    241
True      91
NaN        8

So thank you very much!!!

@Rucha I carried out your tests and the results are as expected :slight_smile:

Test1: I got an aggregation error.
Test2 and Test3: I was able to aggregate properly

I’m hoping that the bug issue gets sorted out!

I want to mark multiple replies as Solutions but I can’t seem to :sweat_smile: In any case, I greatly appreciate all of your help! Thank you very much!

2 Likes

okay so the two True’s are pandas version issue. But the np.nan is still .any() issue.

2 Likes