My apologies @mln for not replying sooner; your response was buried in my feed and I wasn’t notified of your post because I wasn’t directly tagged nor was this message a reply to my messsage…it was just another reply to the original post so I wasn’t notified. Sorry about that and I hope this is still of some value to you!
I think the confusion might be coming from chaining methods and indexing together and drawing some false conclusions between similar looking syntax. For example:
While this bit of code seems to just change the order of the method and indexing, we are in fact using two entirely different methods here (dataframe.mean()
vs series.mean()
); the first one (combined.mean()
) is acting on a pandas dataframe while the second one (combined['sat_score'].mean()
) is acting on a pandas series. It appears as though they have the same syntax, but they are in fact completely different methods. Check out the documentation for the dataframe version of mean and the series version of mean and you will see they have slightly different options for their arguments as well returning different pandas objects as results. So let’s breakdown these two lines of code to see how they are actually quite different from one another.
combined.mean()['sat_score']
: this bit of code starts with a dataframe, finds the mean of all (numeric) columns and returns a pandas series that has an index that comes from the column names of the original dataframe. Placing ['sat_score']
after this returned series will use indexing to return a single scalar value for the mean of sat_score
. So this strategy does a calculation on the entire dataframe and then returns one of those calculated values.
combined['sat_score'].mean()
: this bit of code starts with the same dataframe but is immediately indexed with ['sat_score']
in order to return a pandas series which contains only the SAT scores . This series will have the same index as the original dataframe (ie 0, 1, 2, 3, …, 360, 361, 362). We then calculate the mean of just this one column to produce the same singular scalar value: 1223.4388059701494
. So this strategy starts by reducing the size of our dataset and then does a calculation on this reduced dataset in order to produced the desired result. For this reason, I would use this strategy rather than the first one because it logically makes sense to me to reduce the size of the data down to what I’m after before performing any calculations.
This is why I originally suggested breaking up these lines of code to see what is produced at each step. Knowing what is produced with each code snippet will help you understand why some syntax works while similar looking syntax may not. Let’s breakdown that last example of yours to see why it produces an error.
combined["sat_score"][survey_fields].corr()
: here we start with the dataframe and index it to return a pandas series of the SAT scores. Keep in mind this series will have the same index as combined
…integer values from 0 to 362. So when we follow up this series with [survey_fields]
we will get an indexing error because combined["sat_score"]
does not have an index that’s compatible with [survey_fields]
.
Lastly, let’s look at the original code that caused the confusion: combined.corr()["sat_score"][survey_fields]
. We start with a dataframe then calculate the correlation coefficient of every column against every other column which returns a new dataframe whose index is made up of the column names from our original dataframe. We then use indexing to return one of those columns (["sat_score"]
) as a pandas series and finally we take subsection of that series by using the index labels in survey_fields
.
I think it would really help to read each of these lines of code from left to right and stopping at every .
and every set of []
in order to evaluate what that python object is; is it a dataframe or a series or a dataframe method or a series method or a valid index for the object to the left? At each “crossing” of .
and []
the left and right side must be compatible or you will get strange results if not an outright error.