Follow your data’s inner voice! Evaluation-guided techniques for Machine Learning

Posted by

Spring has come, and the steady work of gardening data is starting to bloom in BigML. We’ve repeatedly stressed in our blog posts the importance of listening to what data has to tell us through evaluations.  A couple of months ago, we published a post explaining how you could achieve accuracy improvements in your models by carefully selecting subsets of features used to build them. In BigMLer we stick to our evaluations to show us the way, and we’d like to introduce you to the ready-to-use evaluation-guided techniques that we’ve recently added to BigMLer’s wide list of capabilities: smart feature selection and node threshold selection. Both are available via a new BigMLer subcommand, analyze, which gives access runnersto these new features through two new command line options.  Now, the power of evaluation-directed modeling starts in BigMLer, your command line ML tool of choice.

k-fold cross-validation

To measure the performance of the models used in both procedures, we have included k-fold cross-validation in BigMLer. In k-fold cross-validation, the training dataset is divided in k subsets of equal size. One of the subsets is reserved as holdout data to evaluate the model generated using the rest of training data. The procedure is repeated for each of the k subsets and finally, the cross-validation result is computed as the average of the k evaluations generated. The syntax for the k-fold cross-validation in BigMLer is:

bigmler analyze --dataset dataset/536653050af5e86d9c01549e \
--cross-validation --k-folds 3

where dataset/536653050af5e86d9c01549e is the id of the training dataset  (if you need help for creating a dataset in BigML with BigMLer you can read our previous posts).

Smart feature selection

Those who read our previous post on this matter will remember that, following the article by Ron Kohavi and George H. John, the idea was to find the subset of available features that will produce a model with better performance. The key thing here is being clever about the way of searching for this feature subset, because the number of possibilities grows exponentially with the number of features. Kohavi and John use an algorithm that finds a shorter path by starting with one-feature models and scoring their performance to select the best one. Then, new subsets are repeatedly created by adding each remaining feature to the best subset, and its scoring is used to select again. The process ends when the score stops improving or you end up with the entire feature set. In that post, we implemented a command tool using BigML’s python bindings to help you do that process using the accuracy (or r-squared for regressions) of BigML’s evaluations, but now BigMLer has been extended to include it as a regular option.

So let’s say you have a dataset in BigML and you want to select the dataset’s features that produce better models. You just type:

bigmler analyze --features \
--dataset dataset/535f5ecc37203f272e0001b4
view raw hosted with ❤ by GitHub
and the magic begins. BigMLer:

  • Creates a 5-fold split of the dataset to be used in evaluations.
  • Uses the smart feature selection algorithm to choose the subsets of features for model construction
  • Creates the 5-fold evaluation of the models created with this subset of features and use its accuracy to score the results , chosing the best subset.
  • Outputs the optimal feature subset and its accuracy.

You probably realize that this procedure generates a large number of resources in BigML. Each k-fold evaluation generates k datasets and, for each feature subset that is being tested, k more models and evaluations are created. Thus, you will probably like to tune the number of folds, as well as other parameters like the penalty per feature (used to avoid overfitting) or the number of iterations with no score improvement that causes the algorithm to stop. This is easy:

bigmler analyze --features \
 --dataset dataset/535f5ecc37203f272e0001b4 \
--penalty 0.002 --staleness 3 --k-folds 2

will use a penalty of 0.002 per feature and stop the search the third time that score does not improve in 2-fold evaluations. You can even speed up the computation by parallelizing the k models (and evaluations) creation. Using

bigmler analyze --features \
--dataset dataset/535f5ecc37203f272e0001b4 \
--k-folds 2 --max-parallel-models 2 \
--max-parallel-evaluations 2
all models and evaluations will be created in parallel.

But sometimes a good accuracy in evaluations will not lead you to a really good model for your Machine Learning problem. Such is the case for spam or anomaly detection problems, where you are interested in detecting the instances that correspond to very sparse classes. The trivial classifier that predicts always false for this class, will have high accuracy in these cases, but of course won’t be of much help. That’s why  it may sometimes be better to focus on precision, recall or composed metrics, such as phi or f-measure. BigMLer is ready to help here too:

bigmler analyze --features \
 --dataset dataset/535f5ecc37203f272e0001b4 \
--maximize recall

will extract the subset of features that maximizes the evaluations recall.

Node threshold selection

Another parameter that can be used to tune your models is their node threshold (that is, the maximum number of nodes in the final tree). Decision tree models usually grow a large number of nodes to fit the training data by maximizing some sort of information gain on each split. Sometimes, though, you can be interested in growing them until they maximize a different evaluation measure. BigMLer offers the

bigmler analyze --nodes \
--dataset dataset/535f5ecc37203f272e0001b4
view raw hosted with ❤ by GitHub
command, that will look for the node threshold that leads to the best score. The command will:

  • Generate a 5-fold split of a dataset to be used in evaluations
  • Grow models using node thresholds going from 3 (the minimum allowed value) to 2000, in steps of 100
  • Create the 5-fold evaluation of the models and use its accuracy as score
  • Stop when scores don’t improve or the node threshold limit is reached, and output the optimal node threshold and its accuracy

As shown in the last section, the associated parameters, i.e. --k-folds, --min-nodes, --max-nodes and --nodes-step can be configured to your liking. Also the --maximize option is available to choose the evaluation metric that you prefer as score. The full fledged version looks like this:

bigmler analyze --nodes \
--dataset dataset/535f5ecc37203f272e0001b4 \
--min-nodes 5 --max-nodes 25 --nodes-step 5 \
--k-folds 2 --maximize phi

where you can choose the value of every parameter.

These are the first evaluation-based tools available in BigMLer to tend to your data, but we plan to add some more to help you get the most of your models. Spring has only just begun, and more of BigML’s buds are about to burst into blossoms, providing colorful new features for you and your data. Stay tuned!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s