217-x Selecting/Filtering data while using a df.method()


I am in the Guided Project for Data Cleaning Walkthrough doing the Analysis of NYC High School Data.

I am following the project solution from the GitHub given by DQ. The next step in the mission is to plot the correlations for the survey_fields list, against the sat_score column in the combined dataframe. The solution to this step is as follows:


My question is, why does the above syntax work? Shouldn’t it look like this:

combined.corr(["sat_score", survey_fields]).plot.bar()


The sytax looks like this :point_up_2: and I can’t figure out why this makes any sense, syntactically speaking.

Thanks in advance!

Trying to figure out my answer and I stumbled across something else. I have known to filter by using .loc attribute. Looks something like this df.loc[ df[column]==X, [column] ]

However, I’m finding instances where the .loc attribute is never used, and is instead replaced with syntax that looks like this: df[ df[“column”]==X] [“column”]

Whered the .loc() go? Is the original post just indexing without using the .loc() attribute?

Here’s another case in the same project, just in case I wasn’t clear above:

gender_fields = [“male_per”, “female_per”]

Let’s disregard the .plot.bar() part, it’s not relevant to your questions.

Question 2

The short answer is “No, it shouldn’t look like you say”. Let’s look at the documentation for the DataFrame.corr method:

DataFrame.corr(method='pearson', min_periods=1)

So combined.corr(["sat_score", survey_fields]) doesn’t fit this in anyway. This, I think, is easy to accept. I hope that answering your first question will bring greater clarity into what’s going on.

Question 3

Please read the last part of screen 291.5:

And connect this with screen 381.9:

Question 1

Whenever you have chained methods and you can’t figure out what’s going on, it’s a good idea to unchained them and make intermediate assignments in order to inspect the process.

combined.corr() looks something like this:

It’s a dataframe. By virtue of what DataFrame.corr method does, combined.corr():

  • Has many rows as it does columns. In other words, it’s a square matrix.
  • The index on the rows is comprised of the column names of combined.

From what we saw on question 3 follows that combined.corr()["sat_score"] is a column of the dataframe combined.corr(). And with a “column” I actually mean a series object. It looks something like this:

This series inherited the index labels from combined.corr().

We can then select some of these rows by passing in a list with the names of the labels we wish to keep. This is what combined.corr()["sat_score"][survey_fields] accomplishes. It is equivalent to combined.corr()["sat_score"].loc[survey_fields], and also to combined.corr().loc["sat_score"].loc[survey_fields].

Hope this helps.

1 Like

This was a fantastic reply. This makes much more sense. I appreciate the time you took to answer.


You can use type to check for the data type.

> import pandas as pd
> type(combined.corr())
> pd.DataFrame 

Then whatever follows after combined.corr() has to satisfy pd.DataFrame object’s definition and methods.