Feature selection (also known as subset selection) is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for application of a learning algorithm. The best subset contains the least number of dimensions that most contribute to accuracy; we discard the remaining, unimportant dimensions. This is an important stage of pre-processing and is one of two ways of avoiding the curse of dimensionality (the other is feature extraction).
There are two approaches:
\begin{description}
\item [forward selection] Start with no variables and add them one by one, at each step adding the one that decreases the error the most, until any further addition does not significantly decrease the error.
\item [backward selection] Start with all the variables and remove them one by one, at each step removing the one that decreases the error the most (or increases it only slightly), until any further removal increases the error significantly.
\end{description}
To reduce overfitting, the error referred to above is the error on a validation set that is distinct from the training set.
"There are two main methods for reducing dimensionality: feature selection and feature extraction. In feature selection, we are interested in finding k of the d dimensions that give us the most information and we discard the other (d - k) dimensions. We are going to discuss subset delection as a feature selection method.
[...]
In subset selection, we are interested in finding the best subset of the set of features. The best subset contains the least number of dimensions that most contribute to accuracy. We discard the remaining, unimportant dimensions. Using a suitable error function, this can be used in both regression and classification problems. There are 2d possible subsets of d variables, but we cannot test for all of them unless d is small and we employ heuristics to get a reasonable (but not optimal) solution in reasonable (polynomial) time.
There are two approaches: In forward selection, we start with no variables and add them one by one, at each step adding the one that decreases the error the most, until any further addition does not decrease the error (or decreases it only sightly (sic)). In backward selection, we start with all variables and remove them one by one, at each step removing the one that decreases the error the most (or increases it only slightly), until any further removal increases the error significantly. In either case, checking the error should be done on a validation set distinct from the training set because we want to test the generalization accuracy. With more features, generally we have lower training error, but not necessarily lower validation error.
[...]"
Alpaydin (2004), p 106
"An important issue that often confronts data miners in practice is the problem of having too many variables. Simply put, not all variables that are measured are likely to be necessary for accurate discrimination and including them in the classification model may in fact lead to a worse model than if they were removed. Consider the simple example of building a system to discriminate between images of male and female faces (a task that humans perform effortlessly and relatively accurately but that is quite challenging for an image classification algorithm), The colors of a person's eyes, hair, or skin are hardly likely to be useful in this discriminative context. These are variables that are easy to measure (and indeed are general characteristics of a person's appearance) but carry little information as to the class identity in this particular case."
Hand, Mannila and Smyth (2001), p 362
"One of the central issues in induction concerns the selection of useful features. Although most learning methods attempt to either select attributes or assign them degrees of importance, both theoretical analyses and experimental studies indicate that many algorithms scale poorly to domains with large numbers of irrelevant features. For example, the number of training cases needed for simple nearest neighbor [...] to reach a given level of accuracy appears to grow exponentially with the number of irrelevant features, independent of the target concept. Even methods for inducing univariate decision trees, which explicitly select some attributes in favor of others, exhibit this behavior for some target concepts. And some techniques, like the naive Bayesian classifier [...], can be very sensitive to domains with correlated attributes. This suggests the need for additional methods to select a useful subset of features when many are available."
Langley (1996), p 233, p 253
"Feature selection, also known as subset selection or variable selection, is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for application of a learning algorithm. Feature selection is necessary either because it is computationally infeasible to use all available features, or because of problems of estimation when limited data samples (but a large number of features) are present. The latter problem is related to the so-called curse of dimensionality."
Wikipedia (2006)
"This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggresive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a $\chi^2$-test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments.[...]"
@inproceedings{YangPedersen97,
author = {Yiming Yang and Jan O. Pedersen},
title = {A Comparative Study of Feature Selection in Text Categorization},
booktitle = {ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning},
year = {1997},
editor = {},
pages = {412--420},
organization = {},
publisher = {Morgan Kaufmann Publishers Inc.},
address = {San Francisco, CA, USA},
month = {},
note = {},
key = {},
abstract = {This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a $\chi^2$-test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a k-nearest neighbor classifier on the Reuters corpus, removal of up to 98\% removal of unique terms actually yielded an improved classification accuracy (measured by average precision). DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 50\% vocabulary reduction but is not competitive at higher vocabulary reduction levels. In contrast, MI had relatively poor performance due to its bias towards favoring rare terms, and its sensitivity to probability estimation errors.}
}
In a comparative study of feature selection methods in statistical learning of text categorization (with a focus is on aggresive dimensionality reduction), \citeasnoun{YangPedersen97} evaluated document frequency (DF), information gain (IG), mutual information (MI), a $\chi^2$-test (CHI) and term strength (TS); and found IG and CHI to be the most effective.
"In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain. We study the strengths and weaknesses of the wrapper approach and show a series of improved designs. We compare the wrapper approach to induction without feature subset selection and to Relief, a filter approach to feature subset selection. Significant improvement in accuracy is achieved for some datasets for the two families of induction algorithms used: decision trees and Naive-Bayes."
@article{KohaviJohn97,
author = {Ron Kohavi and George H. John},
title = {Wrappers for Feature Subset Selection},
journal = {Artificial Intelligence},
year = {1997},
volume = {97},
number = {1--2},
pages = {273--324},
month = {December},
note = {},
key = {},
abstract = {In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain. We study the strengths and weaknesses of the wrapper approach and show a series of improved designs. We compare the wrapper approach to induction without feature subset selection and to Relief, a filter approach to feature subset selection. Significant improvement in accuracy is achieved for some datasets for the two families of induction algorithms used: decision trees and Naive-Bayes.}
}
\citeasnoun{KohaviJohn97} introduced wrappers for feature subset selection. Their approach searches for an optimal feature subset tailored to a particular learning algorithm and a particular training set.
@article{GuyonElisseeff03,
author = {Isabelle Guyon and Andr{\'e} Elisseeff},
title = {An Introduction to Variable and Feature Selection},
journal = {Journal of Machine Learning Research},
year = {2003},
volume = {3},
number = {},
pages = {1157--1182},
month = {March},
note = {},
key = {},
abstract = {Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.}
}
\citeasnoun{GuyonElisseeff03} gave an introduction to variable and feature selection. They recommend using a linear predictor of your choice (e.g. a linear SVM) and select variables in two alternate ways: (1) with a variable ranking method using a correlation coefficient or mutual information; (2) with a nested subset selection method performing forward or backward selection or with multiplicative updates.
JOHN, George H., R. KOHAVI and Karl PFLEGER, 1994. Irrelevant features and the subset selection problem, Machine Learning: Proceedings of the Eleventh International Conference, edited by William W. Cohen and Haym Hirsh, pages 121-129. [Cited by 669] (54.54/year)
"We address the problem of finding a subset of features that allows a supervised induction algorithm to induce small high-accuracy concepts. We examine notions of relevance and irrelevance, and show that the definitions used in the machine learning literature do not adequately partition the features into useful categories of relevance. We present definitions for irrelevance and for two degrees of relevance. These definitions improve our understanding of the behavior of previous subset selection algorithms, and help define the subset of features that should be sought. The features selected should depend not only on the features and the target concept, but also on the induction algorithm. We describe a method for feature subset selection using cross-validation that is applicable to any induction algorithm, and discuss experiments conducted with ID3 and C4.5 on artificial and real datasets."
@inproceedings{JohnKohaviPfleger94,
author = {George H. John and Ron Kohavi and Karl Pfleger},
title = {Irrelevant Features and the Subset Selection Problem},
booktitle = {Machine Learning: Proceedings of the Eleventh International Conference},
year = {1994},
editor = {William W. Cohen and Haym Hirsh},
pages = {121--129},
organization = {},
publisher = {Morgan Kaufmann Publishers},
address = {San Francisco, CA},
month = {},
note = {},
key = {},
abstract = {We address the problem of finding a subset of features that allows a supervised induction algorithm to induce small high-accuracy concepts. We examine notions of relevance and irrelevance, and show that the definitions used in the machine learning literature do not adequately partition the features into useful categories of relevance. We present definitions for irrelevance and for two degrees of relevance. These definitions improve our understanding of the behavior of previous subset selection algorithms, and help define the subset of features that should be sought. The features selected should depend not only on the features and the target concept, but also on the induction algorithm. We describe a method for feature subset selection using cross-validation that is applicable to any induction algorithm, and discuss experiments conducted with ID3 and C4.5 on artificial and real datasets.}
}
\citeasnoun{JohnKohaviPfleger94} addressed the problem of irrelevant features and the subset selection problem. They presented definitions for irrelevance and for two degrees of relevance (weak and strong). They also state that features selected should depend not only on the features and the target concept, but also on the induction algorithm. Further, they claime that the filter model approach to subset selection should be replaced with the wrapper model.
@article{BlumLangley97,
author = {Avrim L. Blum and Pat Langley},
title = {Selection of Relevant Features and Examples in Machine Learning},
journal = {Artificial Intelligence},
year = {1997},
volume = {97},
number = {1--2},
pages = {245--271},
month = {December},
note = {},
key = {},
abstract = {In this survey, we review work in machine learning on methods for handling data sets containing large amounts of irrelevant information. We focus on two key issues: the problem of selecting relevant features, and the problem of selecting relevant examples. We describe the advances that have been made on these topics in both empirical and theoretical work in machine learning, and we present a general framework that we use to compare different methods. We close with some challenges for future work in this area.}
}
\citeasnoun{BlumLangley97} focussed on two key issues: the problem of selecting relevant features and the problem of selecting relevant examples.
@article{JainZongker97,
author = {Anil Jain and Douglas Zongker},
title = {Feature Selection: Evaluation, Application, and Small Sample Performance},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {1997},
volume = {19},
number = {2},
pages = {153--158},
month = {February},
note = {},
key = {},
abstract = {A large number of algorithms have been proposed for feature subset selection. Our experimental results show that the sequential forward floating selection algorithm, proposed by Pudil et al. (1994), dominates the other algorithms tested. We study the problem of choosing an optimal feature set for land use classification based on SAR satellite images using four different texture models. Pooling features derived from different texture models, followed by a feature selection results in a substantial improvement in the classification accuracy. We also illustrate the dangers of using feature selection in small sample size situations.}
}
\citeasnoun{JainZongker97} considered various feature subset selection algorithms and found that the sequential forward floating selection algorithm, proposed by \citeasnoun{PudilNovovicovaKittler94}, dominated the other algorithms tested.
LIU, Huan and Hiroshi MOTODA, 1998. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers Norwell, MA, USA. [Cited by 329] (39.81/year)
@book{LiuMotoda98,
author = {Huan Liu and Hiroshi Motoda},
title = {Feature Selection for Knowledge Discovery and Data Mining},
chapter = {},
pages = {},
publisher = {Kluwer Academic Publishers},
year = {1998},
volume = {},
series = {},
address = {},
edition = {},
month = {},
note = {},
key = {}
}
\citeasnoun{LiuMotoda98} wrote their book on feature selection which offers an overview of the methods developed since the 1970s and provides a general framework in order to examine these methods and categorize them.
KOLLER, D. and M. SAHAMI, 1996. Toward optimal feature selection, Proceedings of the Thirteenth International Conference on Machine Learning, Pages 284-292. [Cited by 363] (35.36/year)
@inproceedings{KollerSahami96,
author = {Daphne Koller and Mehran Sahami},
title = {Toward Optimal Feature Selection},
booktitle = {Proceedings of the Thirteenth International Conference on Machine Learning},
year = {1996},
editor = {},
pages = {284--292},
organization = {},
publisher = {Morgan Kaufmann},
address = {},
month = {July},
note = {},
key = {},
abstract = {In this paper, we examine a method for feature subset selection based on Information Theory. Initially, a framework for defining the theoretically optimal, but computationally intractable, method for feature subset selection is presented. We show that our goal should be to eliminate a feature if it gives us little or no additional information beyond that subsumed by the remaining features. In particular, this will be the case for both irrelevant and redundant features. We then give an efficient algorithm for feature selection which computes an approximation to the optimal feature selection criterion. The conditions under which the approximate algorithm is successful are examined. Empirical results are given on a number of data sets, showing that the algorithm effectively handles datasets with a very large number of features.}
}
\citeasnoun{KollerSahami96} examined a method for feature subset selection based on Information Theory: they presented a theoretically justified model for optimal feature selection based on using cross-entropy to minimize the amount of predictive information lost during feature elimination.
@inproceedings{Weston-etal00,
author = {Jason Weston and Sayan Mukherjee and Olivier Chapelle and Massimiliano Pontil and Tomaso Poggio and Vladimir Vapnik},
title = {Feature Selection for {SVMs}},
booktitle = {Advances in Neural Information Processing Systems 13},
year = {2001},
editor = {Todd K. Leen and Thomas G. Dietterich and Volker Tresp},
pages = {668--674},
organization = {},
publisher = {The MIT Press},
address = {Cambride, MA},
month = {April},
note = {},
key = {},
abstract = {We introduce a method of feature selection for Support Vector Machines. The method is based upon finding those features which minimize bounds on the leave-one-out error. This search can be efficiently performed via gradient descent. The resulting algorithms are shown to be superior to some standard feature selection algorithms on both toy data and real-life problems of face recognition, pedestrian detection and analyzing DNA microarray data.},
conclusion = {In this article we have introduced a method to perform feature selection for SVMs. This method is computationally feasible for high dimensional datasets compared to existing wrapper methods, and experiments on a variety of toy and real datasets show superior performance to the filter methods tried. This method, amongst other applications, speeds up SVMs for time critical applications (e.g pedestrian detection), and makes possible feature discovery (e.g gene discovery). Secondly, in simple experiments we showed that SVMs can indeed suffer in high dimensional spaces where many features are irrelevant. Our method provides one way to circumvent this naturally occuring, complex problem.}
}
\citeasnoun{Weston-etal00} introduced a method of feature selection for SVMs which is based upon finding those features which minimize bounds on the leave-one-out error. The method was shown to be superior to some standard feature selection algorithms on the data sets tested.
@article{DashLiu97,
author = {M. Dash and H. Liu},
title = {Feature Selection for Classification},
journal = {Intelligent Data Analysis},
year = {1997},
volume = {1},
number = {1-4},
pages = {131--156},
month = {},
note = {},
key = {},
abstract = {Feature selection has been the focus of interest for quite some time and much work has been done. With the creation of huge databases and the consequent requirements for good machine learning techniques, new problems arise and novel approaches to feature selection are in demand. This survey is a comprehensive overview of many existing methods from the 1970's to the present. It identifies four steps of a typical feature selection method, and categorizes the different existing methods in terms of generation procedures and evaluation functions, and reveals hitherto unattempted combinations of generation procedures and evaluation functions. Representative methods are chosen from each category for detailed explanation and discussion via example. Benchmark datasets with different characteristics are used for comparative study. The strengths and weaknesses of different methods are explained. Guidelines for applying feature selection methods are given based on data types and domain characteristics. This survey identifies the future research areas in feature selection, introduces newcomers to this field, and paves the way for practitioners who search for suitable methods for solving domain-specific real-world applications.}
}
\citeasnoun{DashLiu97} gave a survey of feature selection methods for classification.
@article{PudilNovovicovaKittler94,
author = {P. Pudil and J. Novovi{\v{c}}ov{\'a} and J. Kittler},
title = {Floating Search Methods in Feature Selection},
journal = {Pattern Recognition Letters},
year = {1994},
volume = {15},
number = {11},
pages = {1119--1125},
month = {November},
note = {},
key = {},
abstract = {Sequential search methods characterized by a dynamically changing number of features included or eliminated at each step, henceforth ``floating'' methods, are presented. They are shown to give very good results and to be computationally more effective than the branch and bound method.}
}
\citeasnoun{PudilNovovicovaKittler94} presented ``floating'' search methods in feature selection. These are sequential search methods characterized by a dynamically changing number of features included or eliminated at each step. They were shown to give very good results and to be computationally more effective than the branch and bound method.
@article{YangHonavar98,
author = {Jihoon Yang and Vasant Honavar},
title = {Feature Subset Selection Using a Genetic Algorithm},
journal = {IEEE Intelligent Systems},
year = {1998},
volume = {13},
number = {2},
pages = {44--49},
month = {March/April},
note = {},
key = {},
abstract = {Practical pattern-classification and knowledge-discovery problems require the selection of a subset of attributes or features to represent the patterns to be classified. The authors' approach uses a genetic algorithm to select such subsets, achieving multicriteria optimization in terms of generalization accuracy and costs associated with the features.}
}
\citeasnoun{YangHonavar98} used a genetic algorithm for feature subset selection.
@article{Forman03,
author = {George Forman},
title = {An Extensive Empirical Study of Feature Selection Metrics for Text Classification},
journal = {Journal of Machine Learning Research},
year = {2003},
volume = {3},
number = {},
pages = {1289--1305},
month = {March},
note = {},
key = {},
abstract = {Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives---accuracy, F-measure, precision, and recall-since each is appropriate in different situations.\\
The results reveal that a new feature selection metric we call `Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms.\\
A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most \textit{likely} to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair---e.g., for greatest recall, the pair BNS + F1-measure yielded the best performance on the greatest number of tasks by a considerable margin.}
}
\citeasnoun{Forman03} presented an empirical comparison of twelve feature selection methods. Results revealed the surprising performance of a new feature selection metric, `Bi-Normal Separation' (BNS).
@inproceedings{XingJordanKarp01,
author = {Eric P. Xing and Michael I. Jordan and Richard M. Karp},
title = {Feature Selection for High-Dimensional Genomic Microarray Data},
booktitle = {ICML '01: Proceedings of the Eighteenth International Conference on Machine Learning},
year = {2001},
editor = {},
pages = {601--608},
organization = {},
publisher = {Morgan Kaufmann},
address = {San Francisco, CA, USA},
month = {},
note = {},
key = {},
abstract = {We report on the successful application of feature selection methods to a classification problem in molecular biology involving only 72 data points in a 7130 dimensional space. Our approach is a hybrid of filter and wrapper approaches to feature selection. We make use of a sequence of simple filters, culminating in Koller and Sahami's (1996) Markov Blanket filter, to decide on particular feature subsets for each subset cardinality. We compare between the resulting subset cardinalities using cross validation. The paper also investigates regularization methods as an alternative to feature selection, showing that feature selection methods are preferable in this problem.}
}
\citeasnoun{XingJordanKarp01} successfully applied feature selection methods (using a hybrid of filter and wrapper approaches) to a classification problem in molecular biology involving only 72 data points in a 7130 dimensional space. They also investigated regularization methods as an alternative to feature selection, and showed that feature selection methods were preferable in the problem they tackled.
@inproceedings{KiraRendell92,
author = {Kenji Kira and Larry A. Rendell},
title = {A Practical Approach to Feature Selection},
booktitle = {ML92: Proceedings of the Ninth International Conference on Machine Learning},
year = {1992},
editor = {Derek H. Sleeman and Peter Edwards},
pages = {249--256},
organization = {},
location = {Aberdeen, Scotland, United Kingdom},
publisher = {Morgan Kaufmann Publishers Inc.},
address = {San Francisco, CA, USA},
month = {},
note = {},
key = {},
abstract = {}
}
\citeasnoun{KiraRendell92} described a statistical feature selection algorithm called RELIEF that uses instance based learning to assign a relevance weight to each feature.
KIRA, K. and L.A. RENDELL, 1992. The feature selection problem: Traditional methods and a new algorithm. AAAI-92: Proceedings of the 10th National Conference on Artifical Intelligence [Cited by 279] (19.54/year)
@inproceedings{KiraRendell,
author = {K. Kira and Rendell},
title = {The feature selection problem: Traditional methods and a new algoirthm},
booktitle = {AAAI-92: Proceedings of the 10th National Conference on Artifical Intelligence},
year = {1992},
editor = {W. Swartout},
pages = {129--134},
organization = {},
publisher = {AAAI Press/The MIT Press},
address = {},
month = {August},
note = {},
key = {},
abstract = {}
}
COUVREUR, C. and Y. BRESLER, 2000. On the optimality of the backward greedy algorithm for the subset selection problem - ?num=100&hl=en&lr=&ie=UTF-8&cluster=8924061170320353040">group of 4 ». SIAM Journal on Matrix Analysis and Applications. [Cited by 35] (5.59/year)
DOAK, J., 1992. An Evaluation of Feature Selection Methods and Their Application to Computer Security. University of California. [Cited by 52] (3.65/year)
FUKUNAGA, K. and W. KOONTZ, 1970. Application of the Karhunen-Loeve expansion to feature selection and ordering. IEEE Trans. Computers. [Cited by 96] (2.65/year)
KIRA, K. and L.A. RENDELL, 1992. The feature selection problem: Traditional methods and a new algorithm. Proceedings of the Tenth National Conference on Artificial …. [Cited by 279] (19.56/year)
LIU, H. and H. MOTODA, 1998. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers Norwell, MA, USA. [Cited by 329] (39.80/year)
NARENDRA, P.M. and K. FUKUNAGA, 1977. A Branch and Bound Algorithm for Feature Subset Selection. IEEE Transactions on Computers. [Cited by 262] (8.95/year)
YANG, Y. and J.O. PEDERSEN, 1996. Feature Selection in Statistical Learning of Text Categorization. Center for Machine Translation, Carnegie Mellon University. [Cited by 74] (7.21/year)