Introduction To Pandas | Dataquest
One of the correct solutions is this :
industry_usa = f500[f500['country']=='USA']['industry'].value_counts().head(2)
The above makes sense intuitively i.e. a df is being subjected to a boolean mask for performing operations on one of the columns - industry.
My query is around syntax which is throwing up error if i try doing it based on my interpretation of lessons.
The way I tried it based on previous lessons is this :
industry_usa = f500['country']=='USA' f500['industry'].value_counts().head(2)
First part is boolean, second is selecting the column. I expected the code to run perfectly, but it did not throwing up syntax error.
Differences between 2 codes are as follows, but would require some guidance or links to syntax on why the output did not happen in the second case :
Based on lessons, my code indexes industry column like this - f500[‘industry’] however the correct code doesn’t mention ‘f500’, but only [industry]
My code doesn’t mention df at the start of the code, but the correct code does. My assumption was that mentioning df at start was not required as it is implicit in the boolean & column selection :
f500['country']=='USA' & f500['industry'].
I am asking questions as there are too few available on DQ on pandas and it would help new students
To help clarify things for you, try printing out each to see what these objects actually are. For example, try:
In order for the boolean mask to be useful to us, we need to apply it to the df itself. Python will not assume this for us…we need to tell it directly what we want that mask to do. In other words, we need
f500 “at the start of the code” in order for it to actually filter our df according to our criteria (ie give us the rows where country == USA).
Once we have this “new” df (one where we only have rows where country == USA) we then want to select just the
industry column. We can do this by simply using
['industry'] after our newly filtered df (ie
f500[f500['country']=='USA']). Using a combination of the solution code and your code, another way to accomplish this task could look like this:
mask_usa = f500['country']=='USA'
f500_usa = f500[mask_usa]
industry_usa = f500_usa['industry']
The reason the solution code doesn’t mention
f500 in this part of the code is because
f500[f500['country']=='USA'] is a df in and of itself! It’s a “sub df” of
f500. If we use
f500['industry'] it will give us all the rows of
f500 but we only want the rows where the country is USA.
I hope this helps clarify things a bit for you and if it doesn’t, please feel free to ask more questions and we can figure it out together!
super ! thank you. I figured out the logic later once i moved ahead…but the way you explained…it made an extra ‘click’. thanks !
Very nice, congrats!
You’re welcome, it was my pleasure. I’m glad I was able to provide some additional help. Extra ‘clicks’ are good