Guide project: employee exit survey question #2

Sorry I have already asked and still have questions from this project for I am in the middle of completing this project.
Question #1:
Is it something wrong with the solution? Why don’t we apply ‘update_val’ function and applymap() for the second dataset dete_survey?
Here is the solution:

Here are my code and output for dete-survey:

I understand that there were not so many datapoints (value) with ‘-’ in dete_survey, but there may be some ‘NaN’ values. And my output is different from the solution’s. Am I wrong or the solution is wrong?

Question #2
In 5/11 of this project , it suggests * You can also plot the values of any numeric columns with [a boxplot] to identify any values that look wrong.

I did try to plot a boxplot to find out outliners of density distribution from those two columns but it didn’t show any plot.

import matplotlib.pyplot as plt
import pandas as pd
dete_resignations[['dete_start_date', 'cease_date']].boxplot()

the output is:
<matplotlib.axes._subplots.AxesSubplot at 0x7f79bfba8d30>

Do I still need to set up the subplot, fig, ax, …plt.show() before any plot would show up? I checked with stackoverflow. It seems you can straight boxplot of any columns from a dataframe.

Sorry I have so many questions. I would really appreciate your help and any input.

I haven’t done this mission, but i am wondering how did nan appear in output of value_counts of tafe_resignation? That means any(1,skipna=False) produced NaN/None/NaT. Any idea where those 8 NaN came from? How do those 8 rows in tafe_resignations after applymap(update_val) (input data to any()) look like?

I just found what may be a bug in print(pd.Series([None]).any(skipna=False)) returning None.
Same for print(pd.DataFrame([None]).any(skipna=False)) returning None.
My understanding of dataframe/series any is it only returns True or False. First time i see it otherwise. If it returns None, value_counts will then convert it to NaN as its index in the output counts.

However, update_vals is defined to return {False,np.nan,True}, so it seems there is no way to produce the row of Nones required to make any output None and value_counts output NaN in index. Could you investigate my question about the 8 Nan?
Note: np.NaN and None are different objects, but both are identified by isnull(). Going through pandas functions can convert one to another. astype(float)/value_counts makes NaN from None, the reverse conversion i don’t know if possible.

I’m thinking why do you have 311 Trues in value_counts too. Putting aside code mismatch between yours and answer, i see you used applymap(update_values) and you mention the data having ‘-’, that means there must be False returned from update_vals and False appearing in dete_resignations value_counts.

On why the second dataset did not apply update_vals, my guess is it wants to avoid going through the elif pd.isnull(x) conversion. Maybe you can reason through based on my information above and tell me what you find? (prefer to try some more before the sending me the whole notebook and data)

  1. If you are doing the project on dataquest platform jupyter then there may be problems out of your control. If you are doing locally, did you try running %matplotlib inline? Actually i’m not sure of the interactions, but i know i need to insert plt.show() between multiple dataframe.plot in 1 cell for them to show. You don’t need any of the setting up fig,ax = plt.subplots(). DataFrame.plot sets up everything for you, including drawing the x/y labels from column/index names.

I usually work with

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

so all variables in cell are printed without wrapping print(var). But disadvantage is i must add _ = to eat up all the useless matplotlib output objects when plotting.

1 Like

@hanqi Thank you for your very detailed answers.

To the 2nd question, You suggested adding %matplotlib inline fix the problem. I actually later checked the stackoverflow and it seems putting this line of code at the beginning of Jupyter notebook is a standard configuration solution nowadays. Thank you again for your advice. :smiley:

As to the first question, I continued with my project and finished it to the end. I did get quite different stats and numbers but the end pivit-table plot showed the same trend, which is, the established and veteran groups are more likely to resign. The biggest difference is, Dataquest has more counts of “False” while I have the opposite.

Here is Dataquest solution

image
image

Here is my solution
image
image

Please show the code for your bar plots. Thank you.

@ChamionThomas Here is my code and output.

Thank you for your help!