When trying to make data-driven decisions, we’re often faced with datasets that contain many more features than what we actually need for decision-making. As a consequence, building models becomes more computationally demanding, and even worse, model performance can suffer, as heavily parameterized models can lead to overfitting. How then, can we pare down our data to the most relevant set of features? In this post, we will discuss a solution proposed by Stanford scientists Ron Kohavi (now partner-level architect at Microsoft) and George H. John (now CEO at Rocket Fuel) in this article, in fact one of the the top 300 most-cited articles according to CiteSeerX.
The Basic Idea
Let’s consider a simple case where the data are composed of three features. We can denote each possible feature subset as a three digit binary string, where the value of each digit denotes whether the corresponding feature is included in the subset. For example, the string ‘101’ represents the subset which includes the first and third features from the original set of three. Our goal therefore, is to find which subset gives the best model performance. For any set of n features, there exist 2^n different subsets, so clearly an exhaustive search is out of the question for any data containing more than a handful of features.
The Advanced Idea
Now, let’s say each of these feature subsets is a node in a state space, and we form connections between those nodes that differ by only a single digit, i.e. the addition or deletion of a single feature. Our challenge therefore is to explore this state space in a manner which will quickly arrive at the most desirable feature subset. To accomplish this, we will employ the Best-First Search algorithm. To direct the search algorithm, we need a method of scoring each node in the state space. We will define the score as the mean accuracy after n-fold cross validation of models trained with the feature subset, minus a penalty of 0.1% per feature. This penalty is introduced to mitigate overfitting. As an added bonus, it also favors models which are quicker to build and evaluate. The above figure illustrates how a single node is evaluated using 3-fold cross validation. Lastly, we need to decide where in the state space to begin the search. Most of the branching will occur in the nodes encountered earlier in the search, so we will begin at the subset containing zero features, so that less time is spent building and evaluating models to explore these early nodes.
Testing the Idea
We’ll try this approach using the credit-screening dataset from the UC Irvine Machine Learning Repository, which contains 15 categorical and numerical features. Building a model using all the features, we get an average accuracy of 81.74% from five-fold cross validation. Not bad, but maybe we can do better… BigML’s Python bindings make it simple and straightforward to implement the search algorithm described above. You can find such an implementation using scikit-learn to create kfold cross validation data sources here. After running the feature subset search, we arrive at feature subset containing just 3 features, which achieves an average cross-validation accuracy of 86.37%. Moreover, we arrived at this result after evaluating only 117 different feature subsets, a far cry from 32768 which would be required for an exhaustive search.