One Hot Encoding in Scikit-Learn: How to feed the encoded data back into a data frame for building a machine learning model?

Hello,
I am using label and then OneHotEncoder to create dummy columns of categorical variables of my data set. As an end results I get a numpy array with the encoded data.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

I understand encoding and know to apply it, but one thing I do not understand.

This data set consists of all ready numerical variables and the categorical variables I want to turn with encoding. So I want the categorical variables add back to the data set after encoding with the given numerical variables. (I can do this for one variable by pd.concat the array to the data frame, but not for many (10 or more) variables.

I am not the only one. If you look at this post


then you see below similar questions like:
How does one feed the encoded data into a machine learning model? Especially now that you cant interpret the data or find which column to use as Label Column

Kind regards

I can do this for one variable by pd.concat the array to the data frame, but not for many (10 or more) variables

I don’t understand this exactly. Are you thinking that OneHotEncoder can only be applied to a single column at a time rather than the entire dataframe, so you have to pd.concat the encoded columns back together and that such concatenation is too much extra code to write? This reminds me of a trick using reduce and pd.concat to conveniently concatenate dataframes by column. Forgot where the source article is but you can try to implement it as an exercise.

How about using pd.get_dummies? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
It allows you to work on the entire dataframe rather than feed in column by column. It leaves numerical columns alone while OHE categorical columns.

df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
                   'C': [1, 2, 3]})

df 

pd.get_dummies(df)
pd.get_dummies(df)   # Column C is left alone when other columns have strings
pd.get_dummies(df['C']) # Column C is OHE when it's the only column used

You can invert a specific single column (before OHE) from get_dummies by boolean indexing with 2D mask to generate NaN so stack() can throw them away.

df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
                   'C': [1, 2, 3]})
df.A
encoded = pd.get_dummies(df.A)
encoded
encoded[encoded==1].stack().reset_index().drop([0,'level_0'],1)

Similarly to pd.get_dummies, OneHotEncoder allows working on entire dataframe at once, difference being OneHotEncoder does not leave that numerical column alone now.

df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
                   'C': [1, 2, 3]})
df
onehot_encoder.fit_transform(df)
onehot_encoder.categories_

ohe = OneHotEncoder(categories=[['a','b'],['a','b','c'],[1,2,3]])
ohe.fit_transform(df).todense()


# Removing 1 category for column B
ohe = OneHotEncoder(categories=[['a','b'],['a','b'],[1,2,3]],handle_unknown='ignore')  # middle 3 cols for ohe column B shrunk to 2
ohe.fit_transform(df).todense()

# Adding extra category
ohe = OneHotEncoder(categories=[['a','b'],['a','b','c','d'],[1,2,3]])  
transformed = ohe.fit_transform(df).todense()
transformed

ohe.inverse_transform(transformed)

With attribute categories_, you can see how what encoder is working with, and with inverse_transform() you can recover the original predictors. (albeit losing column labels)

How does one feed the encoded data into a machine learning model?

Don’t exactly understand what’s the problem with this too.
All the sklearn algorithms have a fit or fit_predict method where you can throw in the 2D encoded data to train the models.

Especially now that you cant interpret the data or find which column to use as Label Column

The label column should be clearly defined by the modeler and placed into a separate data structure from the predictor columns before feeding into models for training so there should be no issue of not being able to identify the label column? For encoding of labels, there is LabelBinarizer: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html
and MultiLabelBinarizer: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

Both are convenience functions that apply the LabelEncoder–>OneHotEncoder pipeline in a single step. You can of course use OneHotEncoder for the labels too. However, the reverse(using LabelBinarizer/MultiLabelBinarizer) for the predictor columns to substitute OneHotEncoder requires that all the columns have the same type (all str or all numerical).

All of the LabelEncoder, OneHotEncoder, LabelBinarizer, MultiLabelBinarizer have inverse_transform methods to recover the string names before encoding. LabelBinarizer’s inverse_transform even accepts predicted probabilities so you can pipe the output of a linear model’s decision_function method directly as the input of inverse_transform (Great feature!)

If i remember right, sklearn api used to not be able to one hot encode strings directly, so a labelencoding step to turn strings to integers is first required, but now it can go straight from string to one hot encoded form.
Please help me clarify your question so i can give a more directed answer.

Hello hanqui,
Thanks a lot for your diligent answer.
My problem is I think related to:
OneHotEncoder does not leave that numerical column alone now.
However there is a lot in it and I have to go through it and test.
Thanks

If there are numeric columns you don’t want to be encoded, you can use pd.get_dummies, if you absolutely must use sklearn’s OneHotEncoder to use their sklearn.pipeline , you can remove the numeric columns first, do OHE, then concat them back. This private method df._get_numeric_data() is useful if you don’t care about possibly breaking code in future when they change the api: https://stackoverflow.com/questions/25039626/how-do-i-find-numeric-columns-in-pandas