I don't understand why this answer works

Hi,

I don’t understand why the provided answer for this section works. Here is the correct answer:

v_null = (mvc[v_col].isnull() & mvc[c_col].notnull()).sum()
c_null = (mvc[c_col].isnull() & mvc[v_col].notnull()).sum()

We were previously instructed that we needed to use parentheses around EACH element of complex Boolean comparison expressions, but this answer doesn’t do that. The components on each side of the ‘&’ are not individually wrapped in their own set of parentheses.

My original code, which did not work, looked like this:

v_null = mvc[(mvc[v_col].isnull()) & (mvc[c_col].notnull())].sum()
c_null = mvc[(mvc[v_col].notnull()) & (mvc[c_col].isnull())].sum()

Why did the correct answer work, even though it didn’t adhere to the formatting that has always been required in the past?

Why did my code not work?

Thank you!

This can be tricky to understand.

But the issue arises because certain operators (like > or == or &) take different precedence when you run a line of code. Similar to what arithmetic operations you execute first in a mathematical equation, based on BODMAS or PEMDAS.

And & has a higher precedence compared to the other logical operators.

So, when you have something like the following -

combined = f500["revenues"] > 100000 & f500["profits"] < 0

The above code will actually try to execute 100000 & f500["profits"] first, and that will throw an error because it’s not really a valid comparison.

But, if you had something like the following instead -

condition1 = f500["revenues"] > 100000
condition2 = f500["profits"] < 0
combined =  condition1 & condition2

Now, you won’t get any error. Because there is no confusion on which boolean comparison will execute first.

The above is similar to the scenario with -

v_null = (mvc[v_col].isnull() & mvc[c_col].notnull()).sum()
c_null = (mvc[c_col].isnull() & mvc[v_col].notnull()).sum()

isnull() is not really an operation that “competes” with & in terms of which will execute first, since isnull() will run first on the Series it’s being applied to. So, there is no error, and you can use that code with or without those additional parenthesis.

This is not really explained in the Classroom, though, I think. But hopefully it helps clear up the confusion a bit.

3 Likes

Thank you! This makes sense.

As for why my original code didn’t work, it turns out I was almost correct. Instead of using the .sum() function at the end of each line, it works if you get the length of each resulting dataframe:

v_null = len(mvc[(mvc[v_col].isnull()) & (mvc[c_col].notnull())].index)
c_null = len(mvc[(mvc[v_col].notnull()) & (mvc[c_col].isnull())].index)

Thanks for the explanation!

Great answer!

My approach was less visually intimidating.

col_labels = ['v_number', 'vehicle_missing', 'cause_missing']

vc_null_data = []

for v in range(1,6):
    v_col = 'vehicle_{}'.format(v)
    c_col = 'cause_vehicle_{}'.format(v)
    
    vv = mvc[v_col].isnull()
    cc = mvc[c_col].isnull()
    
    aa = vv & ~cc
    
    v_null = mvc[aa].shape[0]
    
    bb = cc & ~vv
    
    c_null = mvc[bb].shape[0]
    
    vc_null_data.append([v, v_null, c_null])

vc_null_df = pd.DataFrame(vc_null_data, columns=col_labels)
2 Likes

Hi, I don’t understand the 'v_col = ‘vehicle_{}.format(v)’ part of the code below. What are we doing when we are creating v_col and c_col?

col_labels = ['v_number', 'vehicle_missing', 'cause_missing']

vc_null_data = []

for v in range(1,6):
    **v_col = 'vehicle_{}'.format(v)**
    **c_col = 'cause_vehicle_{}'.format(v)**
    
    v_null = (mvc[v_col].isnull() & mvc[c_col].notnull()).sum()
    c_null = (mvc[c_col].isnull() & mvc[v_col].notnull()).sum()
    
    vc_null_data.append([v, v_null, c_null])

vc_null_df = pd.DataFrame(vc_null_data, columns=col_labels)

They mvc dataframe has columns called vehicle_1, vehicle_2, etc. and cause_vehicle_1, cause_vehicle_2 etc.

The FOR loop is just cycling through the numbers 1 through 5 and inserting them into the braces in vehicle_{}'.format(v). This way you can perform the sum calculations in the next step for each pair of vehicle number and cause number columns, without having to manually type out all the different numbers and combinations.

In execution, it looks like this for the first loop:

The first loop assigns 1 as the value of v. Then:
v_col = 'vehicle_{}'.format(v) ## this inserts v (in this case 1) into the braces. So:
v_col = 'vehicle_1'

Repeat the same process for c_col.

Now you can do your calculations, which, for this iteration of the loop, look like this:

v_null = (mvc['vehicle_1'].isnull() & mvc['cause_vehicle_1'].notnull()).sum()
etc.

Does this make sense?

2 Likes

You are creating something like this:

vehicle_1 to vehicle_5 and cause_vehicle_1 to cause_vehicle_5

These are column names in the dataframe.