Hello @spi
Let me try to explain with this example. Imagine you have a series containing different types of blockchains. The data contains 10,000 rows:
from sys import getsizeof
import pandas as pd
import numpy as np
block_chain = np.random.choice(
["Bitcoin", "Ethereum", "EOSIO", "ZCash", "Corda", "Hyperledger", "Quorum"],
size = 10_000
)
df = pd.Series(block_chain)
We have 7 unique members: ["Bitcoin", "Ethereum", "EOSIO", "ZCash", "Corda", "Hyperledger", "Quorum"]
, and the size of this series is:
print(f"Get Size of the original Series object: {getsizeof(df)} bytes")
Output:
Get Size of the original Series object: 637269 bytes
The original series is an object
or contains string
. Python needs to change this to numbers. We do this using pd.Categorical
and pd.get_dummies
.
When you are making the decision of which one to use, you can look at cardinality
, size
and rank
. If a data has 10,000 rows and the number of unique members of the feature is close to 10,000. This is high cardinality. This extreme case does not have predictive value and this column must be dropped.
Let’s assume the number of unique items is 50. This has predictive value and we can use this feature. If it is okay to introduce rank to the dataset [0 <1, 1<2, ...]
, we use pd.Categorical
. We may not be left with another option if the size of using pd.get_dummies
is too large and makes the program to crash.
For our example, we have 7 unique items. The size when we use pd.Categorical is:
cat_df = pd.Categorical(df)
print(f"Get Size of the Categorical Series object: {getsizeof(cat_df)} bytes")
Output:
Get Size of the Categorical Series object: 10778 bytes
pd.Categorical
converts the unique items to numbers between [0,6]
. The size is significantly reduced compared to the original data.
When we use pd.get_dummies
, we get the following:
dummy_df = pd.get_dummies(df)
print(f"Get Size of the get dummy Series object: {getsizeof(dummy_df)} bytes")
Output:
Get Size of the get dummy Series object: 70160 bytes
The size is less than the original data but larger than when pd.Categorical
is used.
To summarize:
- Use
pd.Categorical
when you have a large number of unique items and using pd.get_dummies
may crash your program. Also, use it is okay to introduce rank to your data
- Use
pd.get_dummies
when the number of unique elements does not create a very large dataframe that will crash your program and when you do not want rank in your data.