Feature Selection - features with low variance

Hello,

In the lecture related to feature selection, one strategy concerned the possibility to drop columns with low variance (since they don’t meaningfully contribute to the model’s predictive capability).
In the case of categorical variables one can say that when columns have a few unique values but more than 95% of the values in the column belong to a specific category then we can drop the column.
But I am having troubles in understanding how to set an appropriate cutoff value for determining which numerical columns have low variance.
For sure it is first needed to scale such columns (e.g. with min-max) but then how can we decide the cutoff value for the corresponding variances?

Thank you in advance for your help!
Jessica

1 Like

Hey @jessica.lanini,

If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed.

Or if only a handful of observations differ from a constant value, the variance will also be very low.

This situation, where a feature has been poorly evaluated, or brings little information because it is (almost) constant can be a justification to remove a column.

Otherwise, you have to set an arbitrarily variance threshold to determine which features to be remove. And, using the accuracy of the predictions as a result of a feature removal to prove that justification for feature removal is correct. Basically, its a trial and error.

The variance threshold calculation depends on the probability density function of a particular distribution. For example if a feature has a normal distribution, use normal variance.

Given in your problem statement, for a feature that has 95% or more variability, it is very close to zero. Hence, the feature will not help the performance of the model to predict the target. Therefore, it should be removed.

Variance threshold for feature selection:

  • low variance features contains less information
  • calculate variance of each feature, then drop features with variance below some threshold

Example using the iris dataset:

from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold
# Load iris data
iris = datasets.load_iris()

# Create features and target
data = iris.data
target = iris.target
# Create VarianceThreshold object with a variance with a threshold of 0.5
thresholder = VarianceThreshold(threshold=.5)

# Conduct variance thresholding
data_high_variance = thresholder.fit_transform(data)
array([[ 5.1,  1.4,  0.2],
       [ 4.9,  1.4,  0.2],
       [ 4.7,  1.3,  0.2],
       [ 4.6,  1.5,  0.2],
       [ 5. ,  1.4,  0.2]])

Example at scikit-learn:
Or you can follow the example at variance-threshold from scikit learn or follow below.

As an example, suppose that we have a dataset with boolean features, and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and the variance of such variables is given by

Boolean features are Bernoulli random variables, and the variance of such variables is given by

Var[X] = p * (1-p)
where p = probability of X = P(X)
and q = probability of not X = 1 - p

so we can select using the threshold .8 * (1 - .8) :

from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

As expected, VarianceThreshold has removed the first column, which has a probability p = 5/6 > 0.8 of containing a zero.

Hello @alvinctk,

Thanks a lot for your detailed answer! Now it is much clearer!

Have a nice day,
Jessica

1 Like