I can do this for one variable by pd.concat the array to the data frame, but not for many (10 or more) variables
I don’t understand this exactly. Are you thinking that OneHotEncoder can only be applied to a single column at a time rather than the entire dataframe, so you have to pd.concat the encoded columns back together and that such concatenation is too much extra code to write? This reminds me of a trick using reduce
and pd.concat
to conveniently concatenate dataframes by column. Forgot where the source article is but you can try to implement it as an exercise.
How about using pd.get_dummies? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
It allows you to work on the entire dataframe rather than feed in column by column. It leaves numerical columns alone while OHE categorical columns.
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
'C': [1, 2, 3]})
df
pd.get_dummies(df)
pd.get_dummies(df) # Column C is left alone when other columns have strings
pd.get_dummies(df['C']) # Column C is OHE when it's the only column used
You can invert a specific single column (before OHE) from get_dummies by boolean indexing with 2D mask to generate NaN so stack()
can throw them away.
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
'C': [1, 2, 3]})
df.A
encoded = pd.get_dummies(df.A)
encoded
encoded[encoded==1].stack().reset_index().drop([0,'level_0'],1)
Similarly to pd.get_dummies
, OneHotEncoder allows working on entire dataframe at once, difference being OneHotEncoder does not leave that numerical column alone now.
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
'C': [1, 2, 3]})
df
onehot_encoder.fit_transform(df)
onehot_encoder.categories_
ohe = OneHotEncoder(categories=[['a','b'],['a','b','c'],[1,2,3]])
ohe.fit_transform(df).todense()
# Removing 1 category for column B
ohe = OneHotEncoder(categories=[['a','b'],['a','b'],[1,2,3]],handle_unknown='ignore') # middle 3 cols for ohe column B shrunk to 2
ohe.fit_transform(df).todense()
# Adding extra category
ohe = OneHotEncoder(categories=[['a','b'],['a','b','c','d'],[1,2,3]])
transformed = ohe.fit_transform(df).todense()
transformed
ohe.inverse_transform(transformed)
With attribute categories_
, you can see how what encoder is working with, and with inverse_transform()
you can recover the original predictors. (albeit losing column labels)
How does one feed the encoded data into a machine learning model?
Don’t exactly understand what’s the problem with this too.
All the sklearn algorithms have a fit
or fit_predict
method where you can throw in the 2D encoded data to train the models.
Especially now that you cant interpret the data or find which column to use as Label Column
The label column should be clearly defined by the modeler and placed into a separate data structure from the predictor columns before feeding into models for training so there should be no issue of not being able to identify the label column? For encoding of labels, there is LabelBinarizer: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html
and MultiLabelBinarizer: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html
Both are convenience functions that apply the LabelEncoder–>OneHotEncoder pipeline in a single step. You can of course use OneHotEncoder for the labels too. However, the reverse(using LabelBinarizer/MultiLabelBinarizer) for the predictor columns to substitute OneHotEncoder requires that all the columns have the same type (all str or all numerical).
All of the LabelEncoder, OneHotEncoder, LabelBinarizer, MultiLabelBinarizer have inverse_transform
methods to recover the string names before encoding. LabelBinarizer’s inverse_transform
even accepts predicted probabilities so you can pipe the output of a linear model’s decision_function method directly as the input of inverse_transform (Great feature!)
If i remember right, sklearn api used to not be able to one hot encode strings directly, so a labelencoding step to turn strings to integers is first required, but now it can go straight from string to one hot encoded form.
Please help me clarify your question so i can give a more directed answer.