# Query on logic behind correlation between categorical columns

Scatter Plots And Correlations | Dataquest

My query is regarding the example provided for correlation for categorical column.
Working day column has only 2 values 1 and 0 . 1 indicates working and 0 indicates non-working. It results in negative correlation with the casual column (-0.52), and a positive correlation with the registered column (+0.30). Even if the binary classification is reversed with 0 indicating workday, only the direction of correlation changes and my query remains the same.

The logic provided by DQ says âŚâregistered users tend to use the bikes more on working days (to commute to work probably), while casual (non-registered) users tend to rent the bikes more on the weekends and holidaysâ

My query is how do you know that registered users are using the bikes on working days. If the correlation of registered users is positive (+0.30), canât it be because registered users are cycling during weekends as well ? In other words, what is the thought process to link positive correlation of registered users with â1â under workday column and not â0â or more with â1â and less with â0â ? We canât assume that registered users are working class and casual are unemployed who shop on weekends only - if that was the background , i would have understood the logic!

Below is the example verbatim:

One example of a categorical column (also called categorical variable) is the workingday column. This column describes the type of day: a working day or a non-working day (weekend or holiday).

bike_sharing[âworkingdayâ].value_counts()
1 500
0 231
Name: workingday, dtype: int64
Although itâs categorical, the workingday column is encoded with numbers (1 means a working day and 0 means a non-working day).

Because itâs encoded with numbers, we can calculate correlations using Series.corr(). For instance, letâs calculate its correlation with the casual and registered columns.

bike_sharing.corr()[âworkingdayâ][[âcasualâ, âregisteredâ]]
bike_sharing.corr()[âworkingdayâ][[âcasualâ, âregisteredâ]]
casual -0.518044
registered 0.303907
Name: workingday, dtype: float64
We can see a negative correlation with the casual column (-0.52), and a positive correlation with the registered column (+0.30).

These values suggest that registered users tend to use the bikes more on working days (to commute to work probably), while casual (non-registered) users tend to rent the bikes more on the weekends and holidays (maybe to spend some leisure time).

That can be possible. Thatâs why they state -

These values suggest that registered users tend to use the bikes more on working days (to commute to work probably)

They state that the values suggest that registered users might be using bikes more on working days.

This has nothing to do with who is employed and who isnât based on the content (I havenât checked any of the previous steps in detail to be sure of this, however.)

The correlation values suggest that registered users might be using bikes more on working days. Based on that itâs a reasonable assumption to make that registered users might be using the bikes for commute. Is it the only reason? No, thatâs why the next Mission Step is on correlation vs causation.

(to commute to work probably)

Maybe thereâs data corresponding to what time registered users used bikes during the working days. If the time suggests that they used bikes more in peak hours (time when most people might travel to and from work), then it is a reasonably strong assumption to make. But without more data and more testing (as the next Step talks about) we canât confirm that in any way.

Similarly, the correlation values suggest that casual users might be renting bikes during weekends more. Again, based on that itâs a reasonable assumption to make that casual renters might be using bikes more for leasure. And, again, itâs an assumption and not a definitive statement.

Maybe the service has pricing tiers. For example, a one-time ride might cost 10\$. People donât want to travel to work and back on those bikes and end up paying like 100\$ a week. So, maybe those people are the casual renters who only use the service now and then, on the weekends when they have time. That doesnât mean they donât use it during the weekdays. Different people might be doing that. But people who do want to travel to work and back or want to use the bikes more frequently are the registered users who have to pay maybe only 40\$ per week. That would explain the particular usage more and also support the correlation values to an extent.

More data and analysis and testing always help to try and explain some pattern.

Either way, the next Mission Step on Correlation vs Causation should give you additional insight on how to interpret these.

thanks. I think what you mentioned about additional testing is the key.

My other query is with respect to sign of the correlation because it changes depending on if working day is 0 or 1. Maybe it helps answering the original query as well. Do let me know if the below makes sense:

1. Casual and registered users correlations are opposite in signs, which may probably hint that as usage of bikes reverses for those two sets of people during the week. It could either be:
a) registered users use less cycling as weekend nears and low on weekends, while casual do the opposite - this is what is suggested in the DQ example, OR
b) vice -versa. registered users âcanâ utilize more cycling towards the end of the week (maybe because they have cycling events planned) and casual users utilize bikes more during weekdays and low on weekends ( letâs say because they stay at home on weekends and run errands on weekdays) - am assuming b) is possible and is not completely wrong âŚ thatâs why more testing is required?

I got the answer âŚonly 1a) is valid and not 1b). More explanation is provided on the screen 522-1.Using scatter plot, it clearly shows that casual are taking bikes out on â0â - i.e. non working daysâŚand registered users on â1â.

Bar Plots, Histograms, Distributions | Dataquest

1 Like