Often, datasets contain features that are irrelevant to the current problem. Feature selection is the process of reducing the number of features in your dataset. The benefit is that the required size of a dataset shrinks, decreasing both training and prediction time while increasing accuracy.
The scikit-learn package contains one implementation that requires you to specify the number of features to select and another implementation that tunes the number of features automatically through cross-validation.
The following example is based on http://scikit-learn.org/dev/auto_examples/plot_rfe_with_cross_validation.html
# Synthesize a classification dataset with 25 total features,
# 3 informative features, 2 redundant features
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1,
random_state=0)
# Select features
from sklearn.feature_selection import RFECV
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import zero_one
featureSelector = RFECV(estimator=SVC(kernel='linear'), step=1,
cv=StratifiedKFold(y, 2), loss_func=zero_one)
featureSelector.fit(X, y)
# Look at fitted parameters of featureSelector
[x for x in dir(featureSelector) if not x.startswith('_') and x.endswith('_')]
# Check the number of features
len(X[0])
# Look at a specific sample of features
X[0]
# Look at how the features have been ranked
featureSelector.ranking_
# Get a boolean index array marking which features are informative
featureSelector.support_
# Count the number of features that have been ranked as informative
print sum(featureSelector.ranking_ == 1)
print sum(featureSelector.support_)
print featureSelector.n_features_
# Look at how the performance of the classifier changes as
# features are included in the dataset in order of informative rank;
# note that the cross-validation score is the number of
# misclassifications because we chose the zero_one loss function
print featureSelector.cv_scores_
# Plot the above information;
# note that after including the third feature,
# the performance of the classifier does not improve
import pylab as pl
pl.figure()
pl.title('Cross-validation scores after recursive feature elimination')
pl.xlabel('Number of features selected')
pl.ylabel('Number of misclassifications')
pl.plot(xrange(1, len(featureSelector.cv_scores_) + 1),
featureSelector.cv_scores_)
pl.show()