Indexing and assigning values

Hi there,

Came across the following code in the Data Analysis in Business course and have the following question. The left side of the equation has 596 rows due to the indexing while the right side has more than 700 rows since it wasn’t indexed. Given the difference in number of rows, why is it that the assignment works in this case?

Many thanks in advance!

affordable_apps.loc[cheap,“price_criterion”] = affordable_apps[“Price”].apply(lambda price: 1.0 if price < cheap_mean else 0.0)

1 Like

I never done this mission. I’ll assume cheap is a boolean series derived from a certain column in affordable_apps so cheap has the same index and number of rows as affordable_apps. I’ll assume price_criterion is a new column created from this line.

Do you get Nan under the price_criterion column for rows that don’t get indexed by cheap on the left hand side?

I have trouble with this syntax too. Examples of assigning to .loc[row,col] are already rare. Having more rows on the RHS is even more rare. Usually the RHS is a single constant to be broadcasted.

Hey, Ryan.

That’s explained in this screen of a previous course. Pandas uses the index to find how to align the rows.

I hope this helps.

1 Like

Hi Bruno, I see that Pandas uses the index to align the rows. However, I don’t understand why we have to specify both the row and the column. In the screen you linked to, it seemed like the dataframe being added to the main dataframe was simply added by using code like this:
df["new_column"] = list_of_values

But here, when I tried to add price_criterion, it wouldn’t do allow me to do so without using .loc[row, column].
What is different about this case?

Hey, Chris.

What you mean with it not allowing you to do so is that it wouldn’t pass the screen, right? Because it should be allowed, if I understood what you said correctly; and it shouldn’t pass the screen. Let’s see why.

I’ll begin by loading a small dataset, just five rows.

>>> df = __import__("seaborn").load_dataset("tips").head()
>>> df["tip_pct"] = df.tip/df.total_bill
>>> df
   total_bill   tip     sex smoker  day    time  size   tip_pct
0       16.99  1.01  Female     No  Sun  Dinner     2  0.059447
1       10.34  1.66    Male     No  Sun  Dinner     3  0.160542
2       21.01  3.50    Male     No  Sun  Dinner     3  0.166587
3       23.68  3.31    Male     No  Sun  Dinner     2  0.139780
4       24.59  3.61  Female     No  Sun  Dinner     4  0.146808

We’ll create a new column called tip_size according to the rule:

  • If tip_pct is larger than 15%, then it should take the value Big tip
  • Otherwise it should take the value Small tip

I’ll do it in a similar way to what was done in the exercise. It will look odd here, but that’s because the exercise has an additional layer of filters that makes the technique adequate there.

We’ll create two auxiliar functions:

>>> def classify_big_tip(tip):
...     if tip > 0.15:
...         return "Big tip"
... 
>>> def classify_small_tip(tip):
...     if tip <= 0.15:
...         return "Small tip"
... 

Answer this question to yourself before reading the rest: What does classify_big_tip(0.1) return?

The technique you suggest, if I understood you correctly, is to run

df["tip_size"] = df.tip_pct.apply(classify_big_tip)
df["tip_size"] = df.tip_pct.apply(classify_small_tip)

Let’s see what changes this makes to the dataframe:

>>> df["tip_size"] = df.tip_pct.apply(classify_big_tip)
>>> df
   total_bill   tip     sex smoker  day    time  size   tip_pct tip_size
0       16.99  1.01  Female     No  Sun  Dinner     2  0.059447     None
1       10.34  1.66    Male     No  Sun  Dinner     3  0.160542  Big tip
2       21.01  3.50    Male     No  Sun  Dinner     3  0.166587  Big tip
3       23.68  3.31    Male     No  Sun  Dinner     2  0.139780     None
4       24.59  3.61  Female     No  Sun  Dinner     4  0.146808     None

It assigned None to the small tips. Let’s run the second line of code.

>>> df["tip_size"] = df.tip_pct.apply(classify_small_tip)
>>> df
   total_bill   tip     sex smoker  day    time  size   tip_pct   tip_size
0       16.99  1.01  Female     No  Sun  Dinner     2  0.059447  Small tip
1       10.34  1.66    Male     No  Sun  Dinner     3  0.160542       None
2       21.01  3.50    Male     No  Sun  Dinner     3  0.166587       None
3       23.68  3.31    Male     No  Sun  Dinner     2  0.139780  Small tip
4       24.59  3.61  Female     No  Sun  Dinner     4  0.146808  Small tip

It assigned None to the big tips and we didn’t get the result we wanted. This is why it’s necessary to assign to specific rows by filtering; so that other rows are unchanged. This is why DataFrame.loc is being used.

1 Like

Hi Bruno,

Thank you for your thorough answer. It helped clarify everything for me. However, I was running into a more rudimentary issue. My code wouldn’t even run, turning up different errors as I tried to change the code.

Ultimately, I think my issue was that I was trying to filter the dataframe in the wrong places. Or, rather, I was misunderstanding where filters were being applied and where I needed to apply them.

Blending your example with the screen’s, I was trying to apply something along the lines of:

def classify_small_tip(df):
    if (df["tip_pct"] <= 0.15) and (df["tip"] > 2) 
        return df["tip_size"] = 1

and then my apply statement would have looked like df.apply(classify_small_tip).

And everything just turned into a mess. :sweat_smile:

Anyway, even though my initial problem wasn’t what you answered, I think I know where I was going wrong, and I would have run headlong into the problem that you explained. So thank you for your response and explanation!

1 Like