A month ago we presented BigMLer, an open-source command line tool that enables the creation of BigML datasets, models and predictions in the Cloud in the twinkling of an eye. We have been working hard on improvements to this tool, putting more capabilities on your command line to manage the Machine Learning Cloud.
Recently, a new member has arrived to our BigML family: evaluations, and everything is being adapted to welcome the newcomer. That’s why we now present an evolved version of BigMLer that extends and improves our existing features. It makes managing your ML resources easier than flying a kite, a child’s game!
It’s raining evaluations. Halleluya!
BigMLer sits on the shoulders of BigML’s API Python bindings to keep you connected to your cloud resources. A new version of the bindings has just been released which includes a REST API interface for evaluations similar to the ones already available for other resources.
Evaluations will tell you how your model performs, showing its accuracy, precision, recall, F-measure and Phi coefficient. To create a new evaluation two main ingredients are needed: a model to be evaluated and a dataset to feed the model and evaluate it, but the only thing you need to start working is a bunch of data that seems to be related to some property you would like to predict.
Usually, the data is split into a training dataset which contains the biggest part of it (say 80%) used to build the model, and a test dataset that has all the data that has not been used in model construction (20%) and becomes a different dataset. Then you should build the model with the training dataset, and create a dataset with the test dataset and then the eval… Wait! This simple command
bigmler --train data/iris.csv --evaluate
does the job for you:
- creates a source from your data file
- creates the corresponding dataset
- creates the associated model using only 80% of the data
- creates an evaluation of the model using the 20% that was originally left out and saves the results in a printable text format and also in json.
What do you think of it? Easy, huh?
Of course, you can follow the traditional way with BigMLer too, and create first the model, the test dataset (for more info on how to create a model or a dataset using BigMLer please refer to our first blog post) and generate an evaluation using the corresponding identifiers
bigmler --model model/50a1f43deabcb404d3000079 \ --dataset dataset/50a1f441035d0706d9000371 --evaluate
or, if you still haven’t created a test dataset, do it in just one call
bigmler --model model/50a1f43deabcb404d3000079 \ --test data/test_iris.csv --evaluate
That command generates a BigML source, and dataset from data/test_iris.csv. Then performs an evaluation of the given model using the generated dataset.
At this moment you may be thinking: but where can I find these identifiers? and where are evaluations stored? As we mentioned in the first post, each new BigMLer call stores its results files in a separate folder. You can choose the folder, but if you don’t a new one is created named using the datetime information (e.g. TueJan2213_182243). There you will find the predictions or evaluations results. And what about the rest of objects it creates on the way? They can also be found there, but here BigMLer has evolved too.
Tell me more, tell me more
In its first incarnation, BigMLer stored the ids of the objects it created under a particular folder, going about its labourious businesses without a word. The current version has become more communicative and we can see the evolution of our resources as we go. For instance, if you issue a typical prediction command:
bigmler --train data/iris.csv --test data/test_iris.csv
BigMLer will show you the steps it goes through to fulfill your request. No more holding your breath between every blink of the cursor. You will see what’s happening instantly and can inspect live each newly created remote resource using its link.
[2013-01-22 18:22:43] Creating source. [2013-01-22 18:22:46] Source created: https://bigml.com/dashboard/source/50fecae337203f19ca0032ed [2013-01-22 18:22:46] Creating dataset. [2013-01-22 18:22:47] Dataset created: https://bigml.com/dashboard/dataset/50fecae637203f19ca0032f1 [2013-01-22 18:22:47] Creating model. [2013-01-22 18:22:48] Model created: https://bigml.com/dashboard/model/50fecae737203f19ca0032f4. [2013-01-22 18:22:48] Retrieving model. https://bigml.com/dashboard/model/50fecae737203f19ca0032f4 [2013-01-22 18:22:48] Creating local predictions. Generated files: TueJan2213_182243 ├─source ├─predictions.csv ├─models ├─bigmler_sessions └─dataset
As you can see, there’s also a summary of the generated files at the end. In the example, the newly created TueJan2213_182243 folder contains a file for each created resource, another for the local predictions and a bigmler_sessions file that stores the same info shown in console.
Of course, you can set
0 and keep it as silent and peaceful as it was.
Come together, right now, to predict
Modern Machine Learning techniques rely on ensembles of many models to achieve the best predictions. We already explained how BigMLer can help you build predictions using ensembles of models. Each of the models issues a prediction and they are further weighted and combined to produce the final ensemble prediction. With BigMLer, basically you ask for the number of models you want to generate and BigMLer does the rest. But what if you could generate predictions by combining the models of two ensembles already in your local computer? BigMLer can. Suppose you used a first ensemble of 20 models to predict a test file
bigmler --train data/iris.csv --number-of-models 20 \ --sample-rate 0.8 --test data/test_iris.csv \ --output my_dir1/predictions.csv
and you’d like to double the number of models involved in the ensemble. Instead of starting all over again with 40 models you could simply create a second 20 models’ ensemble
bigmler --train data/iris.csv --number-of-models 20 \ --sample-rate 0.8 --test data/test_iris.csv \ --output my_dir2/predictions.csv
and combine their predictions to achieve a 40 model ensemble prediction
bigmler --combine-votes my_dir1,my_dir2
Or maybe your data evolves and say you build an ensemble per week and want to predict using the models of the last three weeks. You could create the last week ensemble with the new data and combine votes with the two previously existing ones to reach your goal.
Even more, BigML supports different votation methods. Each model’s prediction can be considered as one vote (plurality), the confidence of the prediction can be used as weight (“confidence weighted”) or the probability of the prediction according to the training data in the prediction node can be used in weighting (“probability weighted”). The default method used to combine the votes is plurality, but can be changed to the one that suits you best by simply adding the
bigmler --combine-votes my_dir1,my_dir2 --method "confidence weighted"
and the information stored in each models’ prediction local file is used with the new voting algorithm. No latency, no remote connection, everything you need is stored in you computer and ready to use.
Don’t give up
As you certainly know, reality can be a drag and many things (network failures to name one) can disturb BigMLer’s connection to the ML-Cloud. When our kite’s rope breaks, BigMLer’s process stops and this can be rather disappointing, specially when you are about to end the 999th model of your 1000 models’ ensemble.
To help you in such cases, the
--resume flag comes to your rescue.
This will recover the last issued command and check its steps to resume work from the last completed one on. Again, in the summary of the resume process you can see the point at which normal process restarts:
Resuming command: bigmler --train data/iris.csv [2013-01-22 20:06:49] Retrieving source. https://bigml.com/dashboard/source/50fee34637203f19ca003317 [2013-01-22 20:06:49] Dataset not found. Resuming. [2013-01-22 20:06:49] Creating dataset. [2013-01-22 20:06:49] Dataset created: https://bigml.com/dashboard/dataset/50fee34937203f19ca00331e [2013-01-22 20:06:49] Creating model. [2013-01-22 20:06:50] Model created: https://bigml.com/dashboard/model/50fee34937203f19ca003321. Generated files: TueJan2213_200645 ├─source ├─models ├─bigmler_sessions └─dataset
So many flags, so little time, what can I do?
With great power comes… great numbers of flags! and maybe their standard default values are not the ones you use frequently, so you have to write your own defaults again and again… No more of that! BigMLer will use your own defaults.
If you place a bigmler.ini configuration file in your working directory, BigMLer will read its configuration parameters from there. Of course, any flag value that you add to the BigMLer command prevails over the user defaults. The syntax of the file should be:
[BigMLer] dev = true resources_log = ./my_log.log
where the flags values to be used are placed, one per line, under a a [BigMLer] section. As you see, the flag name dashes are translated to underscores.
Each working directory can have its own bigmler.ini file. Thus, you could have a ~/dev folder to work with development resources using the example user defaults and another ~/release folder with no bigmler.ini file to generate the final complete resources. Just by changing directories you would generate the resources in the correct environment and should bother no more about adding the required flags. This is a simple glimpse of what you could do, but consider the gain if you have to build models with a 0.7 sample rate, statistical pruning, logging all the predictions… You’ll probably make good use of bigmler.ini.
We’ve seen that our BigMLer scope has included evaluations, while also being more transparent, safe, and improved overall. Want anything else? Let us know and stay tuned!