Convert column to categorical data type and dummy coding

Hello everyone,

I have a general question:

I don’t understand why we first convert a column into a categorical data type and then convert it into a dummy column instead of directly creating dummy variables for that column.

For example:

## Select just the remaining text columns and convert to categorical
text_cols = tdf.select_dtypes(include=['object'])
for col in text_cols:
    transform_df[col] = transform_df[col].astype('category')
    
## Create dummy columns
transform_df = pd.concat([
    transform_df, 
    pd.get_dummies(transform_df.select_dtypes(include=['category']))
], axis=1).drop(text_cols,axis=1)
2 Likes

To improve performance: Category datatype saves memory and speeds up code

:bulb:Tip : in Python, it’s a good practice to typecast categorical features to a category dtype because they make the operations on such columns much faster than the object dtype.

2 Likes

Pandas categorical help keep track of categories seen in s.cat.categories. If there is a mismatch of categories in training and testing set, you can be alerted if such a check is coded.Such a mismatch will cause different number of columns after applying OHE to train vs test, or same number of columns but columns meaning different things (can’t be identified if you only test data frame shape).

Besides better training-serving skew tracking, categorical type helps you label encode (if you don’t want to do OHE) an unseen test item as -1 when each unique seen during training set text has an integer representation (0-n).

3 Likes

You don’t have to technically. You can convert to dummies directly but it good practice to do so.
As @info.victoromondi mention one of the benefit here are some others given in the user-guide for Categorical data.

The categorical data type is useful in the following cases:

  • A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
  • The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
  • As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

More

1 Like

Ok thank so much!

I understood that it was only necessary to convert to categorical data type when the column was numerical, but it needed to be encoded as categorical.