Guided Project - Clean And Analyze Employee Exit Surveys

I have a question on a code in one of the guided projects. In Data Scientist in Python Pathway, Step 2, Course 4, Guided Project: Clean And Analyze Employee Exit Surveys, page 7 talks about Dataframe.any() method. It says that the function evaluates the selected columns and returns True if any element in any of those columns is True, False if none of the elements in those columns is True, and NaN if the values are NaN, as shown in the diagram. The code the content recommended was the following:

df.any(axis=1, skipna=False)

But when I checked the Dataframe.any() method page for clarity, what I am reading tells me that’s not how the code should work. This is what the parameter section on the page tells me:

skipna : bool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

This would mean that with skipna=False, if any element in one of the selected columns is NaN, the result should be True, not NaN. Simply put, there should never be any NaN result with df.any(1, skipna=False).

To check, I went ahead and made a sample dataframe to see if this was the case (please run the code below on Jupyter for your visual):

import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [1, np.nan], "B": [0, np.nan], "C": [0, np.nan]})
print(df)
df['D'] = df[['A','B','C']].any(1, skipna=False)
print("\n")
print(df['D'])

Row 1 in the dataframe, which was my test, indeed resulted in True instead of NaN, as the method page suggested. But when I checked the solution, the solution code does work as the Dataquest content suggests. The following is the solution code:

# Update the values in the contributing factors columns to be either True, False, or NaN
def update_vals(x):
    if x == '-':
        return False
    elif pd.isnull(x):
        return np.nan
    else:
        return True
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(1, skipna=False)

when I print tafe_resignations['dissatisfied'], it does show NaN results if the selected columns were NaN.

The difference between the two seems to be on the df.applymap(update_vals) method, so I did a final check on my sample dataframe above by including df.applymap(update_vals) method before df.any() on my columns as the Data Quest solution showed (please run on Jupyter):

import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [1, np.nan], "B": [0, np.nan], "C": [0, np.nan]})
print(df)
print("\n")
def update_vals(x):
    if x == '-':
        return False
    elif pd.isnull(x):
        return np.nan
    else:
        return True
df['D'] = df[['A','B','C']].applymap(update_vals).any(1, skipna=False)
print(df['D'])

The result showed NaN on Row 1, meaning the solution code works. We can see that the difference indeed lies in adding the df.applymap(update_vals).

My question is, shouldn’t there be no difference? After all, all that df.applymap(update_vals) method is doing to the NaN in the selected columns is confirming that the value is NaN and keeping it as NaN (np.nan). Why is the applymap function changing the behavior of the df.any() method?

2 Likes