Screen Link:
Scatter Plots And Correlations | Dataquest
My query is regarding the example provided for correlation for categorical column.
Working day column has only 2 values 1 and 0 . 1 indicates working and 0 indicates non-working. It results in negative correlation with the casual column (-0.52), and a positive correlation with the registered column (+0.30). Even if the binary classification is reversed with 0 indicating workday, only the direction of correlation changes and my query remains the same.
The logic provided by DQ says âŚâregistered users tend to use the bikes more on working days (to commute to work probably), while casual (non-registered) users tend to rent the bikes more on the weekends and holidaysâ
My query is how do you know that registered users are using the bikes on working days. If the correlation of registered users is positive (+0.30), canât it be because registered users are cycling during weekends as well ? In other words, what is the thought process to link positive correlation of registered users with â1â under workday column and not â0â or more with â1â and less with â0â ? We canât assume that registered users are working class and casual are unemployed who shop on weekends only - if that was the background , i would have understood the logic!
Below is the example verbatim:
One example of a categorical column (also called categorical variable) is the workingday column. This column describes the type of day: a working day or a non-working day (weekend or holiday).
bike_sharing[âworkingdayâ].value_counts()
1 500
0 231
Name: workingday, dtype: int64
Although itâs categorical, the workingday column is encoded with numbers (1 means a working day and 0 means a non-working day).
Because itâs encoded with numbers, we can calculate correlations using Series.corr(). For instance, letâs calculate its correlation with the casual and registered columns.
bike_sharing.corr()[âworkingdayâ][[âcasualâ, âregisteredâ]]
bike_sharing.corr()[âworkingdayâ][[âcasualâ, âregisteredâ]]
casual -0.518044
registered 0.303907
Name: workingday, dtype: float64
We can see a negative correlation with the casual column (-0.52), and a positive correlation with the registered column (+0.30).
These values suggest that registered users tend to use the bikes more on working days (to commute to work probably), while casual (non-registered) users tend to rent the bikes more on the weekends and holidays (maybe to spend some leisure time).