Calculating average values from a df column based on values from other column

Screen Link: Learn data science with Python and R projects

I have been trying to figure out how the average for the casual and registered were calculated in relation to the workingday column. The following screenshot is from the mission.

image

My Code:

I tried to write it as a boolean array to which I would apply the Series/DataFrame.mean() method. Then I tried doing Series.sum() method but it did not work.

This is for my pure curiosity

bike_sharing["workingday"].value_counts() 

Here I know that there are 231 non-working days in the data set.
Then I tried

bike_sharing["casual"].sum() / 231

But the answer is not even close.

What am I missing here in order to calculate the average values like they are in the table?

Hi @vallentin.ciocirlan I was just about to shutdown the computer for the day and saw your question which piqued my curiousity as well!

I haven’t quite figured out how to code it yet but I believe the necessary strategy would be to group the data by workingday and non-working day and then take an average on the casual and registered columns.

I’m thinking: add another column with boolean values for is_a_work_day and use that to filter your data and get the averages you see in the table.

Maybe??

FIRST EDIT: I see now that the dataset already has this column! (workingday) :man_facepalming:

So in the end, my strategy appears to have been sound! :laughing:

I’m not sure if you’ve reached the section on data aggregation yet but here is how DQ generated that table with one line of code:

print(bike_sharing.groupby('workingday').agg(np.mean)[['casual', 'registered']])

LAST EDIT:
Here is how you would find these averages without using data aggregation (ie by filtering the data and then finding the mean…which is code you have definitely seen so far)

print(bike_sharing[bike_sharing['workingday']==0][['casual', 'registered']].mean())
print(bike_sharing[bike_sharing['workingday']==1][['casual', 'registered']].mean())

Hope this satiates your curiousity! :smiley:

1 Like

Yess oh gosh, thank you @mathmike314!!

I was intuitively looking last night to use a boolean array to get this, I knew somehow it had to be a boolean array but where I stumbled myself was in which order should I pass the boolean.

My initial thoughts were:

bike_sharing[bike_sharing["workingday"] ==  0, "casual"].mean() 

But I was getting a boolean dtype list with indexes from 1 to 730.

How should one think about the chaining method? What comes first? When can I use it?

EDIT: Not sure that the code you provided (see below) can be considered boolean indexing, so I apologize for the wrong terminology

I believe this is still considered boolean indexing because bike_sharing["workingday"] == 0 does create a boolean array which we then use to index bike_sharing.

I’m not sure what the “correct answer” is here but I always like to filter the rows first (bike_sharing[bike_sharing['workingday']==0]) and then select the columns ([['casual', 'registered']]).

That’s a very good initial thought and it would have worked if you were to use .loc indexing like so:

bike_sharing.loc[bike_sharing["workingday"] ==  0, "casual"].mean() 
1 Like

Wow, that is a big changer!
Thanks a lot!

Obv., your code is faster.