Categorical vs Dummy Variables

,

Mission Link: https://app.dataquest.io/m/89/introduction-to-decision-trees/3/converting-categorical-variables

Hi team,

In the Introduction to Decision Trees mission, why do we convert categorical columns to numerical columns instead of simply creating dummy variables for them? Wouldn’t that be much simpler and efficient?

Thanks

4 Likes

Hi

I don’t want to create a new topic about this but I have the same doubts.

I don’t clearly understand what the difference is between using pd.Categorical and pd.get_dummies.

When I should use each one, or do they do the “same thing”?

import pandas as pd
import numpy as np

col_name = ['age','workclass','fnlwgt','education','education_num',
'marital_status','occupation', 'relationship','race','sex','capital_gain',
'capital_loss', 'hours_per_week','native_country','high_income']

income = pd.read_csv(
'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
            names=col_name)

# Categorical variables
cat_cols = income.select_dtypes([np.object]).columns

# Convert to categorical variables
income_dummy = income.copy()
for col in cat_cols:
  dummies = pd.get_dummies(income_dummy[col], prefix=col)
  income_dummy = pd.concat([income_dummy, dummies], axis=1)

print(income_dummy.shape)

# Convert to categorical variables
for col in cat_cols:
  cat = pd.Categorical(income[col])
  income[col] = cat.codes

income[cat_cols].head()

I see that with pd.get_dummies we created a lot of new columns… but does it have the same behavior when we fit this data to a machine learning model? (in this lesson a decision tree model)

I have the same question
Unfortunately, nobody answered.

I’m wondering why no one has answered this yet. I don’t know when to use one or the other. Hopefully this comment will bump this thread?

1 Like

@spi

Let me refer you to this article on pd.Categorialhttps://medium.com/swlh/categorical-data-in-pandas-9eaaff71e6f3

Let’s say you have column or feature that describe room temperature as cold, warm, hot and very hot.

Obviously, machine learning algorithms do not understand text, so we need to transform these to numbers. We can do so using pd.Categorical and pd.get_dummies.

With pd.categorical, the data is transformed to numerical values [0, 1, 2, 3]. This introduces order or rank to the data. 0 is less than 1, 1 is less than 2, etc. From the initial data, we cannot tell the numerical difference between cold, warm, hot, and very hot.

To avoid this, pd.get_dummies is used. It is like one-hot encoding. It creates new features/columns [cold, warm, hot, very hot].

When the entry in the original room temperature is hot, the one-hot encoding is [0, 0, 1, 0].

The disadvantage of this method is that it increases the number of columns. However, it does not introduce rank or order to the data set.

1 Like

Thank you! I understand the concept, but have a hard time understanding when to use one or the other. I see that for decision trees we use pd.Categorical instead of creating new columns with pd.get_dummies, but I’m not sure why. And are there any other scenarios where we’d use one instead of the other?

1 Like

Hello @spi

Let me try to explain with this example. Imagine you have a series containing different types of blockchains. The data contains 10,000 rows:

from sys import getsizeof
import pandas as pd
import numpy as np

block_chain = np.random.choice(
    ["Bitcoin", "Ethereum", "EOSIO", "ZCash", "Corda", "Hyperledger", "Quorum"],
    size = 10_000

)
df = pd.Series(block_chain)

We have 7 unique members: ["Bitcoin", "Ethereum", "EOSIO", "ZCash", "Corda", "Hyperledger", "Quorum"], and the size of this series is:

print(f"Get Size of the original Series object: {getsizeof(df)} bytes")
Output:
Get Size of the original Series object: 637269 bytes

The original series is an object or contains string. Python needs to change this to numbers. We do this using pd.Categorical and pd.get_dummies.

When you are making the decision of which one to use, you can look at cardinality , size and rank. If a data has 10,000 rows and the number of unique members of the feature is close to 10,000. This is high cardinality. This extreme case does not have predictive value and this column must be dropped.

Let’s assume the number of unique items is 50. This has predictive value and we can use this feature. If it is okay to introduce rank to the dataset [0 <1, 1<2, ...], we use pd.Categorical. We may not be left with another option if the size of using pd.get_dummies is too large and makes the program to crash.

For our example, we have 7 unique items. The size when we use pd.Categorical is:

cat_df = pd.Categorical(df)
print(f"Get Size of the Categorical Series object: {getsizeof(cat_df)} bytes")
Output:
Get Size of the Categorical Series object: 10778 bytes

pd.Categorical converts the unique items to numbers between [0,6]. The size is significantly reduced compared to the original data.

When we use pd.get_dummies, we get the following:

dummy_df = pd.get_dummies(df)
print(f"Get Size of the get dummy Series object: {getsizeof(dummy_df)} bytes")
Output:
Get Size of the get dummy Series object: 70160 bytes

The size is less than the original data but larger than when pd.Categorical is used.

To summarize:

  • Use pd.Categorical when you have a large number of unique items and using pd.get_dummies may crash your program. Also, use it is okay to introduce rank to your data
  • Use pd.get_dummies when the number of unique elements does not create a very large dataframe that will crash your program and when you do not want rank in your data.
1 Like

Hi @monorienaghogho

I agree with you that we need to be aware about the high cardinality problem but I also heard recently that scipy learn is dealing pretty well with high cardinality because it internally uses sparse matrix when doing machine learning. What is your opinion about it?

1 Like

Hi Spi,

I just went through this section and what helped me in my understanding was thinking about how accurate you are able to estimate and assign the category values. If it is unclear then you may be better off converting those categorical values into their own dummy variables.

An example of categorical values that are very clear would be a likert scale, with values [strongly disagree, disagree, neither agree nor disagree, agree, strongly agree]. These values are on a clear gradient and relatively evenly spaced apart to where we could comfortable assign them numerical values 1 through 6.

An example of categorical values that aren’t as clear an may be better off converting to dummy variables would be the type of flooring a living room has. How much better is spruce than pine? Tile vs carpeting? You may be able to cleverly find a way to assign a value with pricing or some other way, but it is objectively less clear than the example of a likert.

Categorical values are usually better for the model if you can accurately measure them, but since it is difficult to do so for many categorical values, statisticians will use dummy variables because it prevents them from biasing the model with their subjective opinion on the value of things or from spending an exorbitant amount of time guessing and checking how the model changes with different scales.

Hope this somewhat helps.

1 Like

@WilfriedF
pd.get_dummies has a sparse parameter. When this is set to true the size reduces to 50160 bytes, which is still very large.

I converted cat_dt and dummy_df to sparse matrices. Both gave the same size of 64 bytes.

getsizeof(csr_matrix(pd.DataFrame(cat_df.codes)))  # pd.Categorical

getsizeof(csr_matrix(pd.DataFrame(dummy_df.values)))  # pd.get_dummies

An algorithm that processes the data as a sparse matrix will be fast and efficient.

1 Like

Thanks for the response. If I understand well, the Pandas parameter is not compressing very well, right? So it’s probably preferable to use the Scikit-learn transformer OneHotEncoder in this case (sparse parameter is set to True by default), thouhgh I didn’t check the size reduction as you did.

Not an example with dummy variables, but here they compare sparse vs dense performance using coo_matrix: Lasso on dense and sparse data

1 Like

Pandas is built ontop Numpy. When I converted to numpy array/matrix, I got the same size.

As long as the output is a numpy array, it does not make a different. A scipy sparse matrix does.

1 Like

This is strange because scipy is also built on numpy!

I am guessing it has to be how the outputs are retrieved and stored.

For example, iterators are more efficient. They provide output per need.

They also have the same size.

1 Like

In summary, use one-hot encoding for linear models, and use label encoding for decision trees.

Decision trees do not try to minimize distance by evaluating some mathematical functions. They do not care about order. They find the best point to split the data to get maximum purity.

More on this…

Categorical variables can be nominal or ordinal. Use one-hot encoding for nominal categorical variables and label encoding for ordinal categorical variables.

2 Likes