Last time we talked about BigMLer, we saw that the list of BigML resources manageable from the command line included sources, datasets, models, predictions and evaluations. Since then, we’ve been working hard on BigMLer to bring even more cloud-based Machine Learning power to the comfort of your local computer.
In this post we introduce the three latest additions to BigMLer: dataset splitting, cross-validation and ensembles.
Datasets split into datasets:
As we saw in previous BigMLer’s blog posts, the
--sample-rate flag lets you control the fraction of data that will be used to build models and, consequently, what percentage is left for evaluations. However, to have better control of your test data you may want to permanently separate a group of instances from your dataset. OK, says BigMLer, let’s split your dataset:
bigmler --train data/iris.csv --test-split 0.2 --evaluate
in a single line you will
- create a dataset with the entire data
- split the dataset into a test one, holding 20% of the instances, and a training one with the remaining 80%
- create a model with the training dataset
- evaluate this model with the test dataset
Your training and test data will then be accessible as regular dataset objects that can be recovered independently or even split again. If you have a look at the command’s output, you’ll see dataset_train and dataset_test files, which contain the ids for the generated splits. Let’s say that your training dataset is
dataset/5188ddfa37203f085c000008 and your test dataset is
dataset/5188ddfd37203f085c00000b. Then if you want to try the same data but in a model built with statistical pruning:
bigmler --dataset dataset/5188ddfa37203f085c000008 \ --pruning statistical
will build the new model
bigmler --dataset dataset/5188ddfd37203f085c00000b \ --model model/518954b437203f1a6f000000 --evaluate
will use the same test dataset for the evaluation. Then you’ll be able to compare your models and choose the one that performs best.
While splitting a dataset once for evaluating a model is good, sometimes you want to create a handful of different splits to avoid basing your conclusions on a biased sample. Here again, BigMLer can help.
Tuning your models: cross-validation
Usually, when you build a model, you can adjust some parameters to improve the model’s performance. For instance, BigML lets you choose between several pruning modes. Each choice will give you a different decision tree, but how do you know which one works better for your data? Cross-validation will help you with that. As you may know, cross-validation is a technique that helps you estimate the validity of your model using samples of your training data. Nowadays, several types of cross-validation are normally used, but all of them proceed by building many models with samples of your training data and testing them with the data held out during the sampling.
In BigMLer we’ve chosen a Monte-Carlo variant of the cross-validation algorithm, which repeatedly splits at random the dataset into training and test subsets, builds the corresponding models and evaluates them, averaging the results. In this kind of algorithm, there’s no a priori relation between the size of the sample you use to evaluate and the number of evaluations you can run to validate (besides the fact that they should be enough to ensure good coverage of your data). Never mind, you just have to say:
bigmler --train data/iris.csv --cross-validation-rate 0.1
and the tool will take care of the job, namely:
- create a dataset with all of the training data
- hold out a random sample of 10% of the dataset to run evaluations
- use the remaining 90% to build a partial model
- evaluate the model with the held out data
- repeat the previous steps using a different random sampling 2 * n times, where n is the percentage of held out data (in this case, 20 runs), to reduce variance
- finally, average all the partial model’s evaluations to get a close estimation of the complete model’s performance
Or, if you want to choose the number of evaluations, use the
bigmler --train data/iris.csv --cross-validation-rate 0.1 \ --number-of-evaluations 20
BigMLer will store the results in an output directory (see BigMLer’s last blog post for details) where you will find cross_validation.json and cross_validation.txt files, which will contain the average of all the models’ evaluations.
Of course, sometimes a single model can fail to perform well for your data. Again, BigMLer has a solution for you.
Models working together: ensembles
Ensembles are groups of models built by sampling a single dataset. Thus, the models in an ensemble are all different but similar enough, as they are built on a common part of information. As we saw in detail in previous posts, ensembles’ predictions are usually more accurate than those from a single model because the ensemble’s diversity helps smooth out small variations in the individual models, reinforcing their shared features. If you’ve read our first BigMLer post you’ll be familiar with ensembles, as they were available in BigMLer long before they made their appearance in the new version of BigML’s python bindings. Nevertheless, now that ensembles are first class citizens in the BigML API, BigMLer has adapted to handle them as one of your regular resources.
For example, say you created
ensemble/51630c4e37203f2292000082 and want to use it to generate your predictions locally. Just let BigMLer do it:
bigmler --ensemble ensemble/51630c4e37203f2292000082 \ --test data/test_iris.csv
Using this command, the ensemble information is downloaded and the model predictions are combined into a final prediction stored in predictions.csv.
Similarly, evaluating the ensemble with test data amounts to:
bigmler --ensemble ensemble/51630c4e37203f2292000082 \ --test data/iris.csv --evaluate
and you’ll find the usual evaluation.json and evaluation.txt in the output directory. How does that sound to you?
To sum up, BigMLer now includes new features like dataset splitting, model’s cross-validation and ensembles’ predictions and evaluations to help you get the best of your data. Want something else? Let us know and stay tuned!
great work you are doing on simplifying machine learning. I tried your example with the 20-fold cross validation. The accuracies you report in the cross validation text files, are those from the test or training set? Maybe I overlooked that both are present somewhere. Also would it be possible to include standard deviations along the average of your performance measures?
Best wishes and keep up the great work!
Hi Andreas, we appreciate your kind words. They keep us rolling!
Answering your questions, to obtain the evaluation measures that are eventually averaged we do as follows: the whole of your training data is divided in two disjoint subsets by random sampling. One subset is used to train a model (let’s name it the train subset) and the other one (the test subset) to test it and evaluate its perfomance. This is done repeatedly (that’s why this is not a k-fold cross-validation but a repeated random subsampling one) and evaluations are finally averaged. Thus, each evaluation corresponds to input data of a particular test subset, but every time the test subset is chosen randomly from your entire training data.
As to the results saved in the cross-validation file, we’ve reproduced the same structure defined in our evaluation results file, but extending it to include standard deviation measures is perfectly possible and would certainly be an improvement. We’ll add it to the wish list.
Thanks for the feedback!