Home

Feature Selection

Feature Selection (.pdf)

Feature selection (also known as subset selection) is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for application of a learning algorithm. The best subset contains the least number of dimensions that most contribute to accuracy; we discard the remaining, unimportant dimensions. This is an important stage of pre-processing and is one of two ways of avoiding the curse of dimensionality (the other is feature extraction). There are two approaches: \begin{description} \item [forward selection] Start with no variables and add them one by one, at each step adding the one that decreases the error the most, until any further addition does not significantly decrease the error. \item [backward selection] Start with all the variables and remove them one by one, at each step removing the one that decreases the error the most (or increases it only slightly), until any further removal increases the error significantly. \end{description} To reduce overfitting, the error referred to above is the error on a validation set that is distinct from the training set.

"There are two main methods for reducing dimensionality: feature selection and feature extraction. In feature selection, we are interested in finding k of the d dimensions that give us the most information and we discard the other (d - k) dimensions. We are going to discuss subset delection as a feature selection method.
[...]
In subset selection, we are interested in finding the best subset of the set of features. The best subset contains the least number of dimensions that most contribute to accuracy. We discard the remaining, unimportant dimensions. Using a suitable error function, this can be used in both regression and classification problems. There are 2d possible subsets of d variables, but we cannot test for all of them unless d is small and we employ heuristics to get a reasonable (but not optimal) solution in reasonable (polynomial) time.
There are two approaches: In forward selection, we start with no variables and add them one by one, at each step adding the one that decreases the error the most, until any further addition does not decrease the error (or decreases it only sightly (sic)). In backward selection, we start with all variables and remove them one by one, at each step removing the one that decreases the error the most (or increases it only slightly), until any further removal increases the error significantly. In either case, checking the error should be done on a validation set distinct from the training set because we want to test the generalization accuracy. With more features, generally we have lower training error, but not necessarily lower validation error.
[...]"
Alpaydin (2004), p 106

"An important issue that often confronts data miners in practice is the problem of having too many variables. Simply put, not all variables that are measured are likely to be necessary for accurate discrimination and including them in the classification model may in fact lead to a worse model than if they were removed. Consider the simple example of building a system to discriminate between images of male and female faces (a task that humans perform effortlessly and relatively accurately but that is quite challenging for an image classification algorithm), The colors of a person's eyes, hair, or skin are hardly likely to be useful in this discriminative context. These are variables that are easy to measure (and indeed are general characteristics of a person's appearance) but carry little information as to the class identity in this particular case."
Hand, Mannila and Smyth (2001), p 362

"One of the central issues in induction concerns the selection of useful features. Although most learning methods attempt to either select attributes or assign them degrees of importance, both theoretical analyses and experimental studies indicate that many algorithms scale poorly to domains with large numbers of irrelevant features. For example, the number of training cases needed for simple nearest neighbor [...] to reach a given level of accuracy appears to grow exponentially with the number of irrelevant features, independent of the target concept. Even methods for inducing univariate decision trees, which explicitly select some attributes in favor of others, exhibit this behavior for some target concepts. And some techniques, like the naive Bayesian classifier [...], can be very sensitive to domains with correlated attributes. This suggests the need for additional methods to select a useful subset of features when many are available."
Langley (1996), p 233, p 253

"Feature selection, also known as subset selection or variable selection, is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for application of a learning algorithm. Feature selection is necessary either because it is computationally infeasible to use all available features, or because of problems of estimation when limited data samples (but a large number of features) are present. The latter problem is related to the so-called curse of dimensionality."
Wikipedia (2006)

Most Cited

Books

Links

Bibliography