This difference i’m refering to are differences in train test skew (requires 2 datasets), not just the result of pd.get_dummies
on 1 dataset alone.
I’ll use a single column series to demonstrate, because it’s less typing and I can’t copy paste data from your dataframe image.
s = pd.Series(['A','B'])
s_cat = s.astype('category')
train_categories = s_cat.cat.categories
s_test = pd.Series(['A']) # train contains B but missing in test, missing column B added anyway in output
pd.Categorical(s_test,categories=train_categories)
Output:
[A]
Categories (2, object): [A, B]
pd.get_dummies(pd.Categorical(s_test,categories=train_categories)) # B (in train but not in test) filled with 0
Output:
A B
0 1 0
Above demo shows categories in train but not in test, below is in test but not in train.
pd.Categorical(pd.Series(['A','B','C']),categories=['A','B']) # extra unique value C appeared in test
Output:
[A, B, NaN]
Categories (2, object): [A, B]
pd.get_dummies(pd.Categorical(pd.Series(['A','B','C']),categories=['A','B'])) # number of resulting rows stay the same at 3
Output:
A B
0 1 0
1 0 1
2 0 0
I don’t think string type can do such bookkeeping.
The columns after pd.get_dummies
are categorical index CategoricalIndex(['A', 'B'], categories=['A', 'B'], ordered=False, dtype='category')
. If you want to do further feature engineering like
sum across columns, s_cat_dummies['dummysum'] = s_cat_dummies.sum(axis=1)
, you will need to convert to pd.Index
to avoid TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
.
If there are NaNs in train causing NaN to appear in the saved categories, may have to convert them to string type or any sentinel value to prevent other typeerrors.
You can work with pd.Categorical directly, or access their methods through series.cat API :https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.html. Yes there isn’t a dataframe api because it doesn’t make sense to work on multiple columns at once.
Might as well introduce label-encoding since we’re on the topic of categoricals and encoding.
series.cat.codes
for example is a convenient label encoder, outputting -1 for new categories in test set not seen in train set. eg. (NaN/None/new values) This may be bad because the new values are ignored, so retraining including the new categories may be needed if they are useful.
This kind of hacky stuff is hard to find in tutorials, and DQ may not have made it explicit. I learned through reading the categorical docs and watching various pydata youtube.
To explain the big picture idea of bookkeeping categories from train, if you just one-hot-encode the test set on whatever column comes, the data will not be fed properly into the model. Assuming a regression, every coefficient is learned to be applied to a particular column. If you feed some other column from test set into the model, yes it runs but is meaningless. So even if you have the same number of columns in test after OHE as train, and even if you have the same unique categories, you must make sure they are in the same column order, because every run can generate different orderings based on how pythonhashseed
got initialized. Last year I spent months on this problem: Random_state after kernel restart