Techniques to deal with high cardinality variables

Screen Link: https://app.dataquest.io/m/134/machine-learning-project-walkthrough%3A-preparing-the-features/7/categorical-columns

On this page the state column has been dropped because it has high cardinality with following explanation:

Lastly, the addr_state column contains many discrete values and we’d need to add 49 dummy variable columns to use it for classification. This would make our Dataframe much larger and could slow down how quickly the code runs. Let’s remove this column from consideration.

However this might not be a desirable thing to do in practice as we might come across variables which may have crucial information (predictive power) but high cardinality. I would like to know what are some useful techniques of dealing with the issue of high cardinality in the features without making the dataset too wide (post creating dummy variables).

1 Like

Would appreciate if someone from Dataquest team can help with this question. Thanks.

@Sahil This has been unanswered for a month now. Can you please help.

You can find some techniques here

1 Like

Great question! Just a couple thoughts on the topic:

In this instance, what would you expect to see if it was the case that there was high predictive value for this cardinal variable? I would guess that what we would see is that for some states(or even just one state) there would be a significantly different distribution of our target variable. For example, loans given to people in Michigan are 5X more likely to be charged off than anywhere else. If Michigan and California have the same rate of loan loan repayment, then the entropy of loan repayment between those two states is high, and knowing which state a loan came from does not give us any information gain or predictive power. So, lets see if that is the case! First here is chart with loan origination by state, what information do you see here?

It looks like the top few states have almost as many loans as the rest of the country combined. So if we use this variable in our model as is, it will be biased towards those states. Not necessarily applicable here, but something to consider for other cases.

So, what do the states payoff rates look like? Well, I took waaaaay too long coming up with this, but here is a graph that shows us percent paid off by state. As we can see, the range is from about 80-95% with a mean of 86%. While I would rather a bet with a 95% chance of winning than an 80% chance, for the purpose of machine learning this doesnt seem to be enough variance to warrant inclusion in our model, especially given that some of those numbers are generated off only a handful of loans.

However, If there was one standout state (high or low) you could consider breaking that into a binomial dummy group (ie. ‘CA’ and ‘not CA’). Or you could rank the states by % paid off and divide them into maybe 3 performance tiers. I think the key here is consider a few of the following questions:

How much target variability is there across the ordinal values?
How much bias is in the distribution of the ordinal values?
How many ordinal values are there?
Can the values be combined or be expressed another way?

Hope this helps at least get this conversation started.

Hi,

There is a good way to deal quickly with categorical variables: statsmodels + pasty formulas and/or design matrices.

An example of formulas here: Fitting models using R-style formulas — statsmodels
An example of formulas + design matrices here: Getting started — statsmodels

You can also apply for example a polynomial transformation of order X to the categorical data (first you need to build the design matrix) so you will reduce the dimensionality - ending by having a matrix of X + 1 columns instead of 48.