Why so much reliance on DataFrame.apply() in this mission?

Screen Link:
https://app.dataquest.io/m/467/communicating-results/7/price-vs-category-and-genres

I am a bit surprised that the suggested answers in this mission and the previous one seem to rely so much on what seems to be unnecessary use of DataFrame.apply().
In earlier courses, we’ve been introduced to vectorized functions as well as joins and told of the speed advantage of these techniques over looping over the data.
I noticed that all questions in this mission can be answered just fine without using DataFrame.apply(). So am I missing something here?

Here’s perhaps the most over-complicated example:

DQ’s answer:

categories_mean = affordable_apps.groupby(
    ["affordability", "Category"]
).mean()[["Price"]]

def label_categories(row):
    """For each segment in `categories_mean`,
    labels the apps that cost less than its segment's mean with `1`
    and the others with `0`."""

    aff = row["affordability"]
    cat = row["Category"]
    price = row["Price"]

    if price < categories_mean.loc[(aff, cat)][0]:
        return 1
    else:
        return 0

affordable_apps["category_criterion"] = affordable_apps.apply(
    label_categories, axis="columns"
)

My answer:

categories_mean = affordable_apps.groupby(["affordability","Category"]).mean()["Price"]
# this join adds a temporary columnn "Price_y" which contains the mean price for each row's segment:
affordable_apps = affordable_apps.join(categories_mean, how="inner", on=["affordability","Category"],rsuffix="_y")
# if-then-else logic in two lines :
affordable_apps["category_criterion"] = 0 # the else case
affordable_apps.loc[affordable_apps["Price"] < affordable_apps["Price_y"],"category_criterion"] = 1 #the if case
del affordable_apps["Price_y"] 
affordable_apps.sort_index(inplace=True) #this is only added to pass the answer checker which seems to care about order

Would appreciate your thoughts what would actually be the preferred approach here.

EDIT:
Here’s another, simpler, example for the next screen:

DQ’s answer:

def new_price(row):
    if row["affordability"] == "cheap":
        return round(max(row["Price"], cheap_mean), 2)
    else:
        return round(max(row["Price"], reasonable_mean), 2)
    
affordable_apps["New Price"] = affordable_apps.apply(new_price, axis="columns")

Vectorized answer:

affordable_apps["New Price"] = affordable_apps["affordability"].map({"cheap":cheap_mean, "reasonable":reasonable_mean})
affordable_apps["New Price"] = affordable_apps[["Price","New Price"]].max(axis=1).round(2)
5 Likes

Great way of using vectorization. The pd.DataFrame.join() approach was nice and more easy to understand than the custom function.
Grat job!

1 Like

Hey! These are some good points, thank you for sharing them.

No. One of your techniques does create an unnecessary column (that you end up deleting), which ends up duplicating data, but that’s not a real issue here, nor the reason why pandas.DataFrame.apply was used extensively here.

I might incoporate some of these :slight_smile:

1 Like

Hi,
Sorry if this has already been covered, how does this function below work?
Specifically, the [0] index at the end of the line
if price < categories_mean.loc[(aff, cat)][0]:
Many thanks,
David

def label_categories(row):
“”“For each segment in categories_mean,
labels the apps that cost less than its segment’s mean with 1
and the others with 0.”""

aff = row["affordability"]
cat = row["Category"]
price = row["Price"]

if price < categories_mean.loc[(aff, cat)][0]:
    return 1
else:
    return 0

Hey, David.

Please ask this in a new topic.

Really awesome answer. I was struggling with the Dataquest answer because I have avoided using Dataframe.Apply. Came looking for an explanation and found your code which I had no problem understanding. Thank you!

1 Like