Need help to understand the function

Hello world, I am still confused with this function, especially aff and gc. What are aff and gc exactly?

Screen Link: Learn data science with Python and R projects

Correct Code:
def label_genres(row):
aff = row[“affordability”]
gc = row[“genre_count”]
price = row[“Price”]
if price < genres_mean.loc[(aff, gc)][0]:
return 1
else:
return 0

I thought aff is affordable_apps[“affordability”] and gc=affordable_apps[“genre_count”], but I got an error when I tried to do genres_mean.loc[(aff,gc)][0]. Can anyone help me to explain this? Thank you

Hi @dhu4. Since the link you provided shows the code nicely formatted, it’s easy to read but you may want to check out this post for tips on how to format your code when posting to the community. Without proper indentation, it can be difficult to know what’s happening.

I’m not entirely sure, but I think the confusion you’re experiencing is due to the genres_mean object. If you use the variable inspector you’ll see that this variable contains the following data:


                                           Price
affordability	genre_count	
        cheap	          1         	2.507448
                          2	            3.155672
   reasonable	          1	            12.574627
                          2	            6.823333

This shows that genres_mean is a dataframe object with a multi-index. Therefore, when we access genres_mean.loc[(aff,gc)][0] we are accessing one of the values in the Price column, depending on the values for aff and gc.

Since aff and gc are defined within the scope of the label_genres function, we cannot call on them outside of the function definition. If we do, we will get an error (as you did) which probably said something like:

NameError: name 'aff' is not defined

Try adding a couple print statements inside the function definition and see if this helps demystify what these variables are doing for us:

def label_genres(row):
    """For each segment in `genres_mean`,
    labels the apps that cost less than its segment's mean with `1`
    and the others with `0`."""

    aff = row["affordability"]
    print('This is aff: ' + str(aff))
    gc = row["genre_count"]
    print('This is gc: ' + str(gc))
    price = row["Price"]

    if price < genres_mean.loc[(aff, gc)][0]:
        return 1
    else:
        return 0

affordable_apps["genre_criterion"] = affordable_apps.apply(
    label_genres, axis="columns"
)

Based the output you observe from the above code, you should notice that aff and gc each point to single data points in the pandas series affordable_apps['affordability'] and affordable_apps['genre_count'] respectively.

At the end of the day, aff and gc can each take on one of two values: cheap or reasonable for aff and 1 or 2 for gc. Therefore, if you want to use genres_mean.loc[(aff,gc)][0] outside of the function definition, you can simply replace aff and gc with corresponding valid values such as:

genres_mean.loc[('cheap', 1)][0]
genres_mean.loc[('cheap`, 2)][0]
genres_mean.loc[('reasonable', 1)][0]
genres_mean.loc[('reasonable', 2)][0]

Each of these lines will return the corresponding mean Price based on affordability and genre_count.

I hope this helps answer your question and if not, please let me know and we can try something else.

Happy coding!

3 Likes

Thank you so much for your explanation, Mike! They are very helpful.

1 Like

QUICK QUESTION

  1. Does this code below make genres_mean a series or a dataframe object? I personally think is a Series but you refer to it as a dataframe object in your comment.
genres_mean = affordable_apps.groupby(
    ["affordability", "genre_count"]
).mean()["Price"]
  1. in the codes below

genres_mean.loc[(aff, gc)][0] and genres_mean.loc[(aff, gc)] seems to produce the same answer. which is the price associaated with the particular affordability and genre_count.
as such why the need for [0] in the code

One way to find out is to run this bit of code:

print(type(genres_mean))

So what did you find out? Is it a DataFrame or a Series?

This also can be clarified with an appropriately placed print() statement:

    print(type(genres_mean.loc[(aff, gc)][0]))
    print(type(genres_mean.loc[(aff, gc)]))

This bit of code shows that the first line is a float while the second line produces a series. Since we need to compare the mean price against the price of each app in order to label it correctly, we need to use the [0] in order to obtain a float that we can use in our if statement for comparison purposes.

It’s been a while since I looked at this code so I hope this makes sense!

2 Likes

thanks

concerning the series or dataframe, i found this post below which helped clarify my confusion

We want to use methods of `DataFrame`. Hence, we use `[["Price"]] `.

**Using `[ ]` result in a `DataFrame`**

genres_mean = affordable_apps.groupby(
[“affordability”, “genre_count”]
).mean()[[“Price”]]


Now, `genres_mean` is `DataFrame`.

**Using without `[ ]` result in a `Series`**

genres_mean = affordable_apps.groupby(
[“affordability”, “genre_count”]
).mean()[“Price”]


Now, `genres_mean` is `Series`.

**To check for type of object,**

based on this understanding i understood the genres_mean.loc[(aff, gc)][0] instead of genres_mean.loc[(aff, gc)]

1 Like