What purpose does conversion to 'category' dtype serve?

Screen Link: https://app.dataquest.io/m/239/processing-and-transforming-features/3/dummy-coding

In any ML algorithm we have to encode any categorical variables to dummy variables for fitting the model. However the get_dummies() can be used on string data (object dtype) as well.
My question is what purpose does converting variables to ‘category’ dtype serve from a model fitting perspective?
Unlike ‘R’ where once variables are declared as factors no ‘dummy encoding’ is required to be done manually, what will ‘category’ dtypes help us with in Python.

For fixing train-test skew. What if you see categories in train but not test, test but not train. Categorical type saves the seen categories in train and fill in NaN if they don’t appear in test, and removes extra columns from test one hot encoding if they don’t appear in train. Applying pd.get_dummies on categorical column is different than applying on string columns.

Do you mean label encoding?
I’m not sure if anything is “required”. If you wanted to feed factors into the algorithm, you would label encode, else you could one hot encode too. Some encoding schemes are better for decision tree training,
so whether a language can do a certain type of encoding with less work wouldn’t be a good reason to choose a certain encoding, because its the problem and analysis that informs the tools chosen from the program, not vice versa.

@hanqi - Thanks for the response but I am still a bit confused about this statement -

Applying pd.get_dummies on categorical column is different than applying on string columns.

Question 1
Let us consider the train data frame shown below:
image

If in the above data frame I have Cat_data as category and text_data as object then what would be different if I apply pd.get_dummies() on these two columns.

Question 2
In your response you also mention that

Categorical type saves the seen categories in train and fill in NaN if they don’t appear in test, and removes extra columns from test one hot encoding if they don’t appear in train.

Can you explain the above point using the test data as shown below. (How are unseen values ‘E’ and ‘five’ handled):

image

This difference i’m refering to are differences in train test skew (requires 2 datasets), not just the result of pd.get_dummies on 1 dataset alone.

I’ll use a single column series to demonstrate, because it’s less typing and I can’t copy paste data from your dataframe image.

s = pd.Series(['A','B'])
s_cat = s.astype('category')
train_categories = s_cat.cat.categories

s_test = pd.Series(['A'])   # train contains B but missing in test, missing column B added anyway in output
pd.Categorical(s_test,categories=train_categories)

Output:
[A]
Categories (2, object): [A, B]
pd.get_dummies(pd.Categorical(s_test,categories=train_categories))  # B (in train but not in test) filled with 0

Output:
   A  B
0  1  0

Above demo shows categories in train but not in test, below is in test but not in train.

pd.Categorical(pd.Series(['A','B','C']),categories=['A','B'])  # extra unique value C appeared in test

Output:
[A, B, NaN]
Categories (2, object): [A, B]
pd.get_dummies(pd.Categorical(pd.Series(['A','B','C']),categories=['A','B']))  # number of resulting rows stay the same at 3

Output:
   A  B
0  1  0
1  0  1
2  0  0

I don’t think string type can do such bookkeeping.

The columns after pd.get_dummies are categorical index CategoricalIndex(['A', 'B'], categories=['A', 'B'], ordered=False, dtype='category'). If you want to do further feature engineering like
sum across columns, s_cat_dummies['dummysum'] = s_cat_dummies.sum(axis=1), you will need to convert to pd.Index to avoid TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category.

If there are NaNs in train causing NaN to appear in the saved categories, may have to convert them to string type or any sentinel value to prevent other typeerrors.

You can work with pd.Categorical directly, or access their methods through series.cat API :https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.html. Yes there isn’t a dataframe api because it doesn’t make sense to work on multiple columns at once.

Might as well introduce label-encoding since we’re on the topic of categoricals and encoding.

series.cat.codes for example is a convenient label encoder, outputting -1 for new categories in test set not seen in train set. eg. (NaN/None/new values) This may be bad because the new values are ignored, so retraining including the new categories may be needed if they are useful.

This kind of hacky stuff is hard to find in tutorials, and DQ may not have made it explicit. I learned through reading the categorical docs and watching various pydata youtube.

To explain the big picture idea of bookkeeping categories from train, if you just one-hot-encode the test set on whatever column comes, the data will not be fed properly into the model. Assuming a regression, every coefficient is learned to be applied to a particular column. If you feed some other column from test set into the model, yes it runs but is meaningless. So even if you have the same number of columns in test after OHE as train, and even if you have the same unique categories, you must make sure they are in the same column order, because every run can generate different orderings based on how pythonhashseed got initialized. Last year I spent months on this problem: Random_state after kernel restart