Correlation and no.ones_like 370-5

Screen Link:

Working With Missing Data | Dataquest

Codes are as provided DQ. My queries are in italics.
Require a bit of nudge to think in right direction:
1)
cols_with_missing_vals = df.columns[df.isnull().sum() > 0]
missing_corr = df[cols_with_missing_vals].isnull().corr() # why is .isnull() required, when .isnull() has been used to create the boolean in previous line of code. I am tying it back to logic of creating a dataframe using boolean. For example, df[df[‘sales’]>500], where sales is one of the columns. We do not code it as df[df[‘sales’]>500][df['sales’], i.e. repeat [df[‘sales’] twice.

  1. missing_corr = missing_corr.iloc[1:, :-1]
    mask = np.triu(np.ones_like(missing_corr), k=1) I understand from documentation that it is creating an array similar to missing_corr and is populated with 1s. however, how is that helping in creating the mask ? Or what kind of information is the mask hiding

Print out df[cols_with_missing_vals] and df[cols_with_missing_vals].isnull() to see what the difference is and try to figure out why the latter might be required. Check out DQ’s correlation related content you covered if need be.

Print out mask and see the pattern in the array to understand what it will hide in relation to missing_corr.

Let me know if you have questions.

For this code, i think what it means to say is that:
a) first line is selecting columns where there is atleast 1 null value in the column,

b) and the second row is creating dataframe from the columns selected in a). The syntax for correlation is df.corr(). I think the second .isnull() is assisting in creating the dataframe.
My doubt still remains - we created a boolean by selecting the columns with atleast 1 null value and then mapped it on to the dataframe. That should do the job. Why the second .isnull(). I tried running the code on DQ console, but is not generating output for me to compare.

cols_with_missing_vals = df.columns[df.isnull().sum() > 0]
missing_corr = df[cols_with_missing_vals].isnull().corr()

  1. I did quite a bit of research on the mask parameter of heatmap. I understand it , but it’s not intuitive yet. I think DQ should explain - mask, triu, no.ones_like - in detail. Or call it out as a guided project.

For benefit of others stumbling on to this screen, i list it down the basic workings:
a) since the correlation heatmap is identical across the main diagonal, we are hiding the top most part (of the heatmap. The top part is triangular in shape. To hide, we use built-in parameter “mask”. Mask hides value, where the values are set to “True”. It also hides values if value =1, which is interpreted as “True”.

b) To create a mask, we will use:
i) no.ones_like ( to create an array of 1s with shape “like” the dataframe used for heatmap). The 1s will be treated as “True” and the values will be hidden when the mask is “overlaid” on the actual dataframe. Some of these 1s will be converted to “0” and will be treated as “False” by seaborn and therefore the values will be displayed. To convert 1 into 0 , we will use np.triu.

ii) and np.triu - which refers to upper triangle of the heatmap/array. (np.tril refers to lower triangle.) np.triu uses 2 parameters - array with shape identical to the data being used( no.ones_like is passed) and k - which indicates position of diagonal, below which all values will be converted to “0”. Play around with k=0/1/2/ to understand the values being hidden, as the dataframe values which are masked by "0"s are kept as visible ( because “0” is interpreted as False as per mask)

Notes to myself:

My query:

My doubt still remains - we created a boolean by selecting the columns with atleast 1 null value and then mapped it on to the dataframe. That should do the job. Why the second .isnull(). I tried running the code on DQ console, but is not generating output for me to compare.

cols_with_missing_vals = df.columns[df.isnull().sum() > 0]
missing_corr = df[cols_with_missing_vals].isnull().corr()

Solution - I was assuming that dataframe[‘name of columns’] will give a new data frame where columns have already been filtered for null value.

The columns have NOT been filtered for null values, but filtered for only the "names" of the columns. That is why .isnull() is required after dataframe[‘name of columns’] to create the data frame , i.e. dataframe[‘name of columns’] will provide all the values of columns including null ( but we do know that the names of columns that have “null values” because of the previous line of code). With that logic the second line of code makes sense - dataframe[‘name of columns’].isnull(). Once the new df is created correlation can be found by using .corr().