Fly your ML-Cloud like a kite with BigMLer: the command-line tool for Machine Learning

Posted by

Fly your ML-Cloud like a kite with BigMLerA month ago we presented BigMLer, an open-source command line tool that enables the creation of BigML datasets, models and predictions in the Cloud in the twinkling of an eye. We have been working hard on improvements to this tool, putting more capabilities on your command line to manage the Machine Learning Cloud.

Recently, a new member has arrived to our BigML family: evaluations, and everything is being adapted to welcome the newcomer. That’s why we now present an evolved version of BigMLer that extends and improves our existing features. It makes managing your ML resources easier than flying a kite, a child’s game!

It’s raining evaluations. Halleluya!

BigMLer sits on the shoulders of BigML’s API Python bindings to keep you connected to your cloud resources. A new version of the bindings has just been released which  includes a REST API interface  for evaluations similar to the ones already available for other resources.

Evaluations will tell you how your model performs, showing its accuracy, precision, recall, F-measure and Phi coefficient. To create a new evaluation two main ingredients are needed: a model to be evaluated and a dataset to feed the model and evaluate it, but the only thing you need to start working is a bunch of data that seems to be related to some property you would like to predict.

Usually, the data is split into a training dataset which contains the biggest part of it (say 80%) used to build the model, and a test dataset that has all the data that has not been used in model construction (20%) and becomes a different dataset. Then you should build the model with the training dataset, and create a dataset with the test dataset and then the eval… Wait! This simple command

bigmler --train data/iris.csv --evaluate

does the job for you:

  • creates a source from your data file
  • creates the corresponding dataset
  • creates the associated model  using only 80% of the data
  • creates an evaluation of the model using the 20% that was originally left out and saves the results in a printable text format and also in json.

What do you think of it? Easy, huh?

Of course, you can follow the traditional way with BigMLer too, and create first the model, the test dataset  (for more info on how to create a model or a dataset using BigMLer please refer to our first blog post) and generate an evaluation using the corresponding identifiers

bigmler --model model/50a1f43deabcb404d3000079 \
        --dataset dataset/50a1f441035d0706d9000371 --evaluate

or, if you still haven’t created a test dataset, do it in just one call

bigmler --model model/50a1f43deabcb404d3000079 \
        --test data/test_iris.csv --evaluate

That command generates a BigML source, and dataset from data/test_iris.csv. Then performs an evaluation of the given model using the generated dataset.

At this moment you may be thinking: but where can I find these identifiers? and where are evaluations stored? As we mentioned in the first post, each new BigMLer call stores its results files in a separate folder. You can choose the folder, but if you don’t a new one is created named using the datetime information (e.g. TueJan2213_182243). There you will find the predictions or evaluations results. And what about the rest of objects it creates on the way? They can also be found there, but here BigMLer has evolved too.

Tell me more, tell me more

In its first incarnation, BigMLer stored the ids of the objects it created under a particular folder, going about its labourious businesses without a word.  The current version has become more communicative and we can see the evolution of our resources as we go. For instance, if you issue a typical prediction command:

bigmler --train data/iris.csv --test data/test_iris.csv

BigMLer will show you the steps it goes through to fulfill your request. No more holding your breath between every blink of the cursor. You will see what’s happening instantly and can inspect live each newly created remote resource using its link.

[2013-01-22 18:22:43] Creating source.
[2013-01-22 18:22:46] Source created:
[2013-01-22 18:22:46] Creating dataset.
[2013-01-22 18:22:47] Dataset created:
[2013-01-22 18:22:47] Creating model.
[2013-01-22 18:22:48] Model created:
[2013-01-22 18:22:48] Retrieving model.
[2013-01-22 18:22:48] Creating local predictions.

Generated files:


As you can see, there’s also a summary of the generated files at the end. In the example, the newly created TueJan2213_182243  folder contains a file for each created resource, another for the local predictions and a bigmler_sessions file that stores the same info shown in console.

Of course, you can set --verbosity to 0 and keep it as silent and peaceful as it was.

Come together, right now, to predict

Modern Machine Learning techniques rely on ensembles of many models to achieve the best predictions. We already explained how BigMLer can help you build predictions using ensembles of models. Each of the models issues a prediction and they are further weighted and combined to produce the final ensemble prediction. With BigMLer, basically you ask for the number of models you want to generate and BigMLer does the rest. But what if you could generate predictions by combining the models of two ensembles already in your local computer? BigMLer can. Suppose you used a first ensemble of 20 models to predict a test file

bigmler --train data/iris.csv --number-of-models 20 \
        --sample-rate 0.8 --test data/test_iris.csv \
        --output my_dir1/predictions.csv

and you’d like to double the number of models involved in the ensemble. Instead of starting all over again with 40 models you could simply create a second 20 models’ ensemble

bigmler --train data/iris.csv --number-of-models 20 \
        --sample-rate 0.8 --test data/test_iris.csv \
        --output my_dir2/predictions.csv

and combine their predictions to achieve a 40 model ensemble prediction

bigmler --combine-votes my_dir1,my_dir2

Or maybe your data evolves and say you build an ensemble per week and want to predict using the models of the last three weeks. You could create the last week ensemble with the new data and combine votes with the two previously existing ones to reach your goal.

Even more, BigML supports different votation methods. Each model’s prediction can be considered as one vote (plurality), the confidence of the prediction can be used as weight (“confidence weighted”) or the probability of the prediction according to the training data in the prediction node can be used in weighting (“probability weighted”). The default method used to combine the votes is plurality, but can be changed to the one that suits you best by simply adding the --method flag

bigmler --combine-votes my_dir1,my_dir2
        --method "confidence weighted"

and the information stored in each models’ prediction local file is used with the new voting algorithm. No latency, no remote connection, everything you need is stored in you computer and ready to use.

Don’t give up

As you certainly know, reality can be a drag and many things (network failures to name one) can disturb BigMLer’s connection to the ML-Cloud. When our kite’s rope breaks, BigMLer’s process  stops and this can be rather disappointing, specially when you are about to end the 999th model of your 1000 models’ ensemble.

To help you in such cases, the --resume flag comes to your rescue.

bigmler --resume

This will recover the last issued command and check its steps to resume work from the last completed one on. Again, in the summary of the resume process you can see the point at which normal process restarts:

Resuming command:
bigmler --train data/iris.csv

[2013-01-22 20:06:49] Retrieving source.
[2013-01-22 20:06:49] Dataset not found. Resuming.
[2013-01-22 20:06:49] Creating dataset.
[2013-01-22 20:06:49] Dataset created:
[2013-01-22 20:06:49] Creating model.
[2013-01-22 20:06:50] Model created:

Generated files:


So many flags, so little time, what can I do?

With great power comes… great numbers of flags! and maybe their standard default values are not the ones you use frequently, so you have to write your own defaults again and again… No more of that! BigMLer will use your own defaults.

If you place a bigmler.ini configuration file in your working directory, BigMLer will read its configuration parameters from there. Of course, any flag value that you add to the BigMLer command prevails over the user defaults. The syntax of the file should be:

dev = true
resources_log = ./my_log.log

where the flags values to be used are placed, one per line, under a a [BigMLer] section. As you see, the flag name dashes are translated to underscores.

Each working directory can have its own bigmler.ini file. Thus, you could have a ~/dev folder to work with development resources using the example user defaults and another ~/release folder with no bigmler.ini file to generate the final complete resources. Just by changing directories you would generate the resources in the correct environment and should bother no more about adding the required flags. This is a simple glimpse of what you could do, but consider the gain if you have to build models with a 0.7 sample rate, statistical pruning, logging all the predictions… You’ll probably make good use of bigmler.ini.

We’ve seen that our BigMLer scope has included evaluations,  while also being more transparent, safe, and improved overall. Want anything else? Let us know and stay tuned!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s