Where can I find further quality info on calling pd.corr()?

Screen Link:

My Code:

combined.corr()["sat_score"][survey_fields] # why does this work?

What I expected to happen:
I had trouble calling the pd.corr method correctly when comparing the “sat score” series against a df of survey_fields. Looking at the documentation here: pandas.DataFrame.corr — pandas 1.3.2 documentation gave no useful information on why the code above would work. So my question is… Where can I find useful examples whenever I am in doubt on calling a pd method correctly?

Thank you in advance for the feedback.

What actually happened:

# The solution code worked but I would have liked to figure it out by myself :-)

In order to demystify this line of code and figure out what it’s doing, try breaking it up. Start with


followed by


and then ultimately the entire line of code. This should give you a better picture of how it all fits together and what each component is doing.

Unfortunately for these types of situations, there really isn’t a “one-stop-solution” where you can get an explanation or be shown examples that will make things clear for you…EXCEPT…there is and you went there: the DQ community! :sunglasses:

Hi Mike,

Thanks for the quick reply and the community response.

My problem with the above code is probably in the way indexing is done in Python Pandas. Up until this point, I had probably gotten used to the fact that indexing was done on the left side of the method call or in some cases inside the method call (e.g. x=‘some column’, y=‘some other column’).

I now understand that in some cases I can do e.g.:


which will yield similar results. However, the following is not allowed:
combined["sat_score"][survey_fields].corr()as this leads to an index failure.

So this leaves me a bit confused about the way indexing (or subindexing) works in Pandas. Is there a general rule of thumb or is it a case-by-case method-dependent approach?

Kind regards,

My apologies @mln for not replying sooner; your response was buried in my feed and I wasn’t notified of your post because I wasn’t directly tagged nor was this message a reply to my messsage…it was just another reply to the original post so I wasn’t notified. Sorry about that and I hope this is still of some value to you!

I think the confusion might be coming from chaining methods and indexing together and drawing some false conclusions between similar looking syntax. For example:

While this bit of code seems to just change the order of the method and indexing, we are in fact using two entirely different methods here (dataframe.mean() vs series.mean()); the first one (combined.mean()) is acting on a pandas dataframe while the second one (combined['sat_score'].mean()) is acting on a pandas series. It appears as though they have the same syntax, but they are in fact completely different methods. Check out the documentation for the dataframe version of mean and the series version of mean and you will see they have slightly different options for their arguments as well returning different pandas objects as results. So let’s breakdown these two lines of code to see how they are actually quite different from one another.

combined.mean()['sat_score']: this bit of code starts with a dataframe, finds the mean of all (numeric) columns and returns a pandas series that has an index that comes from the column names of the original dataframe. Placing ['sat_score'] after this returned series will use indexing to return a single scalar value for the mean of sat_score. So this strategy does a calculation on the entire dataframe and then returns one of those calculated values.

combined['sat_score'].mean(): this bit of code starts with the same dataframe but is immediately indexed with ['sat_score'] in order to return a pandas series which contains only the SAT scores . This series will have the same index as the original dataframe (ie 0, 1, 2, 3, …, 360, 361, 362). We then calculate the mean of just this one column to produce the same singular scalar value: 1223.4388059701494. So this strategy starts by reducing the size of our dataset and then does a calculation on this reduced dataset in order to produced the desired result. For this reason, I would use this strategy rather than the first one because it logically makes sense to me to reduce the size of the data down to what I’m after before performing any calculations.

This is why I originally suggested breaking up these lines of code to see what is produced at each step. Knowing what is produced with each code snippet will help you understand why some syntax works while similar looking syntax may not. Let’s breakdown that last example of yours to see why it produces an error.

combined["sat_score"][survey_fields].corr(): here we start with the dataframe and index it to return a pandas series of the SAT scores. Keep in mind this series will have the same index as combined…integer values from 0 to 362. So when we follow up this series with [survey_fields] we will get an indexing error because combined["sat_score"] does not have an index that’s compatible with [survey_fields].

Lastly, let’s look at the original code that caused the confusion: combined.corr()["sat_score"][survey_fields]. We start with a dataframe then calculate the correlation coefficient of every column against every other column which returns a new dataframe whose index is made up of the column names from our original dataframe. We then use indexing to return one of those columns (["sat_score"]) as a pandas series and finally we take subsection of that series by using the index labels in survey_fields.

I think it would really help to read each of these lines of code from left to right and stopping at every . and every set of [] in order to evaluate what that python object is; is it a dataframe or a series or a dataframe method or a series method or a valid index for the object to the left? At each “crossing” of . and [] the left and right side must be compatible or you will get strange results if not an outright error.