Boolean Indexing

Hello all,

I am practising boolean indexing in pandas intro section " Exploring Data with pandas: Fundamentals" and I’m at the end of the task where the instructions are to " Create a series, industry_usa , containing counts of the two most common values in the industry column for companies headquartered in the USA".

the correct code is apparently

industry_usa = f500['industry'][f500['country'] == 'USA'].value_counts().head(2)
sector_china = f500["sector"][f500["country"] == "China"].value_counts().head(3)

There has been no example where we use indexing as described above.

Why can we suddenly use series one after the other i.e f500[‘industry’][f500[‘country’]. The back to back series calling is something I didn’t know we could do. Could someone please explain why this works?

https://app.dataquest.io/m/381/exploring-data-with-pandas%3A-fundamentals/12/challenge-top-performers-by-country

Thank you in advance

P.S I’m very new to this.

8 Likes

This is how i like to visualise it:
industry_usa =f500[f500['country'] == 'USA']['industry'].value_counts().head(2)

So recall f500["industry"] is one way to select a series from the dataframe. [f500['country'] == "USA"] creates the boolean indexer for filtering the selected series. So the value_counts() method is called on the filtered series.
Hope this will be helpful.

1 Like

Thanks for the response Austin. I think I understand. But why is there no comma separating [f500[‘country’] == ‘USA’] and [‘industry’]? as in ([f500[‘country’] == ‘USA’] , [‘industry’]) Also, why are we not using loc for this? I appreciate the help.

3 Likes

There are more than one way to do things. This is syntactic flavour. dataframe[column_name] is the common way of selecting a series. So in this syntax, you specify that you want rows that match the given criteria. The " , " is used in loc & .iloc to sepecify the row and the column part. So,
f500.loc[f500['county'] == 'USA'] is the rows you want ,followed by a comma, then the column you want.

1 Like

I really appreciate your reply. If I could ask one more question. So if I were to do this the second way, it would be

industry_usa =f500[[f500['country'] == 'USA'], 'industry'].value_counts().head(2)

Would this be the correct way to do the alternative method. If I chose not to use back to back square brackets?

Thank you again.

I also can’t follow the logic for the solution provided. I hope somebody from DQ can give us more explanation behind this technique.

industry_usa = f500[“industry”] [f500[“country”] == “USA”].value_counts().head(2)

If I do the following, which achieves the same result, it works.

industry_usa= f500.loc[f500[“country”] == “USA”,“industry”].value_counts().head(2)

3 Likes

To the best of my knowledge i would say: No; that is not the right way. The comma is for .loc or .iloc method to differentiate rows and columns.

I also happened to be at this exercise recently, and was also struggling a bit. This is what worked for me:

industry_usa = f500.loc[f500["country"]=="USA","industry"].value_counts().head(2)

which is the same solution as @ryreisback gave.

And would also be curious to learn if that is a solid solution as well, and/or less preferred than any other solution.

3 Likes

This helped me visualize the logic.

Original

industry_usa = f500['industry'][f500["country"]=="USA"].value_counts().head(2)

Broken Down

industry = f500['industry']
a_bool = f500["country"]=="USA"
industry_usa2 = industry[a_bool].value_counts().head(2)

It looks like f500['industry'] is just a series you can apply a boolean vector to. Hope this helps someone.

11 Likes

I also had difficulty understanding the f500[‘industry’][f500[‘country’] == ‘USA’] logic.

Would the two below then give the same result?

f500[‘industry’][‘country’]
f500[[‘industry’, ‘country’]]

Thanks to all of you that explained other solutions early in the discussion. I tried the .loc method, however I received an error.

This was my code:
usa = f500[‘country’]==‘USA’
industry_usa = f500.loc[usa,‘industry’].value_counts().head(2)

china = f500['country'] =='China'
sector_china ==f500.loc[china,'sector'].value_counts().head(3)

My usa series worked, however, the china one did not. Is there a reason for this?

This is honestly one of the more annoying challenges so far. If you’re going to throw us a new way of displaying data, please make sure it was reviewed at some point beforehand.

Best I can tell, there’s nothing in any of the prior lessons indicating this pattern is the correct one to use, which is pretty frustrating to a new learner. Save the trickiness for a later part in the course.

3 Likes

I agree! We go through all of that selection-replacing schema, chaining, etc, then the solution here is in a format we did several chapters ago.

2 Likes

Thank you for the feedback! I will let the content team know about it.

Best,
Sahil

1 Like