Why create DataFrame for `genres_mean` and `category_mean`?

Screen Link:
https://app.dataquest.io/m/467/communicating-results/7/price-vs-category-and-genres

My Code:

cols =  ["affordability", "Category"]

categories_mean = affordable_apps.groupby(cols).mean()['Price']

affordable_apps['category_criterion'] = affordable_apps.apply(
lambda r: 1 if r['Price'] < categories_mean[tuple(r[cols])] else 0, axis = 1)

Hi,

I noticed in the sample code below(and in the answer provided by DQ), double-brackets was used in genres_mean which creates a DataFrame which leads to specifying index 0 in the function’s if/else statement where:

if price < genres_mean.loc[(aff, gc)][0]:
       return 1

Sample code for genres_mean:

genres_mean = affordable_apps.groupby(
    ["affordability", "genre_count"]
).mean()[["Price"]]

I don’t really get why is it necessary to ‘DataFrame’ it, leaving it as a ‘Series’ seems more convenient to me. Am I missing anything here? :thinking:

As you can see I simplified the sample code quite a bit in this mission. I definitely sacrificed some readability, but, I don’t know, I think it’s still readable. :stuck_out_tongue_winking_eye:

All thoughts are appreciated. :smiley:

2 Likes

Not at all necessary. Completely depends on how you wish to use it further.

Setting it as a DataFrame allows us to then use .loc to index it based on the two columns. That’s how they have their current approach designed, hence the use.

Your approach, however, is roughly 2-3 times slower than theirs if you time it and compare, at least for that specific Pandas version. So, on much larger datasets, it could be helpful to know this difference.

3 Likes

Thanks a lot for the reply! I didn’t think about processing time but it’s definitely good and essential to know.

1 Like

hi there,
Could somebody help me to understand the syntax below

if price < genres_mean.loc[(aff, gc)][0]:
       return 1

I’m not sure why we need to use [0] there.

Please create a separate post to ask your question since it’s not related to this one. I would also recommend checking out existing questions corresponding to this Mission Step’s tag. There are a couple of such existing questions which should help you out as well.

1 Like

Hi,

On the same question as @veratsien I understand that the interpretation of the code snippet:

genres_mean = affordable_apps.groupby(
    ["affordability", "genre_count"]
).mean()[["Price"]]

…takes Price and adds it as a column on the df.groupby() result as a DataFrame. But, tweaking with the code’s output, I found that genres_mean = affordable_apps.groupby(["affordability", "genre_count"])[["Price"]].mean() gives exactly the same result. Do you know why is that?

Wouldn’t it be more readable to have the code layed out like this? Would it affect as a processing time constraint?

Thank you in advance for your feedback. Take care!

Focus on the order of operations in both of the codes and you will start to understand why this happens. If you have further questions, please create a new, separate post.

Thank you kindly for your answer!