Act fast, special offers end soon! Up to $294 is savings when you get Premium today.
Get offer codes

Selecting rows/columns from DF. 347-4

Screen Link:

Working With Missing Data In Pandas | Dataquest

Basic query that has me stumped. Why is code 1) not working, but 2) works as per DQ to select rows/columns

  1. regions_2017 = combined[combined[‘YEAR’] ==2017], combined[‘REGION’]

  2. regions_2017 = combined[combined[‘YEAR’] ==2017][‘REGION’]

You are performing two separate operations here separated by a comma -

  • combined[combined[‘YEAR’] ==2017]

    • The above will return a DataFrame containing all rows and columns for the year 2017.
  • combined[‘REGION’]

    • The above will return a Series containing rows from the column REGION

When you use that comma between them -

var = DataFrame, Series

In Python, the above will result in var being a tuple where index 0 would be that DataFrame and index 1 would be that Series.

That’s not what you want at all.

You want those operations “chained” so that you first get the rows and columns such that YEAR == 2017 and then from that output you extract the rows for the REGION column.

1 Like

that makes perfect sense. thank you. Could you help me with how this is different from numpy 2d array boolean indexing logic mentioned ahead.

I was trying to replicate the logic in italics , where the boolean(year==2017) replaces the row & region represents the column i.e. top_tips = taxi[tip_bool, 5:14].

Boolean Indexing With NumPy | Dataquest

tip_amount = taxi[:,12]
tip_bool = tip_amount > 50
top_tips = taxi[tip_bool, 5:14]

I would highly recommend paying closer attention to how you use/apply brackets in your code first.

Secondly, the equivalent in Pandas worth considering could have been either -

combined[combined[‘YEAR’] ==2017,  "REGION"]


combined[combined[‘YEAR’] ==2017,  1]

Both try to follow a similar seemingly logical pattern -

dataframe[boolean_indexed_rows, column_name]


dataframe[boolean_indexed_rows, column_number]

But, both are not valid options in Pandas. If you want to combine boolean indexing to get rows and accessing a specific column, then you would have to use loc -

combined.loc[combined[‘YEAR’] ==2017,  "REGION"]

The above does the same thing as the DQ solution combined[combined[‘YEAR’] ==2017][‘REGION’].

There are alternatives to the above as well, I believe. You can go through the documentation to learn and experiment if you want to - Indexing and selecting data — pandas 1.3.2 documentation

1 Like

Awesome. Thanks. ".loc " is the key. I am just trying to link various missions and trying to remember different ways DQ is solving the same type of problems, before moving on to read other codes in future.

1 Like