Selecting rows/columns from DF. 347-4

Screen Link:

Working With Missing Data In Pandas | Dataquest

Basic query that has me stumped. Why is code 1) not working, but 2) works as per DQ to select rows/columns

  1. regions_2017 = combined[combined[‘YEAR’] ==2017], combined[‘REGION’]

  2. regions_2017 = combined[combined[‘YEAR’] ==2017][‘REGION’]

You are performing two separate operations here separated by a comma -

  • combined[combined[‘YEAR’] ==2017]

    • The above will return a DataFrame containing all rows and columns for the year 2017.
  • combined[‘REGION’]

    • The above will return a Series containing rows from the column REGION

When you use that comma between them -

var = DataFrame, Series

In Python, the above will result in var being a tuple where index 0 would be that DataFrame and index 1 would be that Series.

That’s not what you want at all.

You want those operations “chained” so that you first get the rows and columns such that YEAR == 2017 and then from that output you extract the rows for the REGION column.

that makes perfect sense. thank you. Could you help me with how this is different from numpy 2d array boolean indexing logic mentioned ahead.

I was trying to replicate the logic in italics , where the boolean(year==2017) replaces the row & region represents the column i.e. top_tips = taxi[tip_bool, 5:14].

Boolean Indexing With NumPy | Dataquest

tip_amount = taxi[:,12]
tip_bool = tip_amount > 50
top_tips = taxi[tip_bool, 5:14]

I would highly recommend paying closer attention to how you use/apply brackets in your code first.

Secondly, the equivalent in Pandas worth considering could have been either -

combined[combined[‘YEAR’] ==2017,  "REGION"]


combined[combined[‘YEAR’] ==2017,  1]

Both try to follow a similar seemingly logical pattern -

dataframe[boolean_indexed_rows, column_name]


dataframe[boolean_indexed_rows, column_number]

But, both are not valid options in Pandas. If you want to combine boolean indexing to get rows and accessing a specific column, then you would have to use loc -

combined.loc[combined[‘YEAR’] ==2017,  "REGION"]

The above does the same thing as the DQ solution combined[combined[‘YEAR’] ==2017][‘REGION’].

There are alternatives to the above as well, I believe. You can go through the documentation to learn and experiment if you want to - Indexing and selecting data — pandas 1.3.2 documentation

Awesome. Thanks. ".loc " is the key. I am just trying to link various missions and trying to remember different ways DQ is solving the same type of problems, before moving on to read other codes in future.

