Skip to content

BigML Winter 2017 Release Webinar Video is Here!

As announced in our latest blog posts, Boosted Trees is the new supervised learning technique that BigML offers to help you solve your classification and regressions problems. And it is now up and running as part of our set of ensemble-based strategies available through the BigML Dashboard and our REST API.

If you missed the webinar that was broadcasted yesterday, here you have another chance to follow our latest addition. In fact, you can play it anytime you wish since it’s available on the BigML Youtube channel.

Please visit our dedicated Winter 2017 Release page for more learning resources, including:

  • The Boosted Trees documentation to learn how to create, interpret and make predictions with this algorithm, from both the BigML Dashboard and the BigML API.
  • The series of six blog posts that guide you in the Boosted Trees journey step by step. Starting with the basic concepts of this algorithm and the differences between the other ensembles we offer; continuing with a use case and several examples of how to use Boosted Trees through the Dashboard, API, or how to automate the workflows with WhizzML and the Python Bindings; and finally wrapping up with the most technical side of how Boosted Trees work behind the scenes.

Many thanks for your attention, your questions, and the positive feedback after the webinar. We cannot wait to announce the next release!

The Down Low on Boosting

If you’ve been following our blog posts recently, you know that we’re about to release another variety of ensemble learner, Boosted Trees. Specifically, we’ve implemented a variation called gradient boosted trees.

Let’s quickly review our existing ensemble methods. Decision Forests take a dataset with either a categorical or numeric objective field and build multiple independent tree models using samples of the dataset (and/or fields). At prediction time, each model gets to vote on the outcome. The hope is that the mistakes of each tree will be independent from one another. Therefore, in aggregate, their predictions will come to the correct answer. In ML parlance, this is a way to reduce the variance.


With Boosted Trees the process is significantly different. The trees are built in serial and each tree tries to correct for the mistakes of the previous. When we make a prediction for a regression problem, the individual Boosted Trees are summed to find the final prediction. For classification, we sum up pseudo-probabilities for each class and run those results through Softmax to create final class probabilities.


Each iteration makes our boosted meta-model more complex. That additional complexity can really pay off for datasets with nuanced interactions between the input fields and the objective. It’s more powerful, but with that power comes the danger of overfitting, as boosting can be quite sensitive to noise in the data.

Many of the parameters for boosting are tools for balancing the power versus the risk of overfitting. Sampling at each iteration (BigML’s ‘Ensemble Sample’), the learning rate, and the early holdout parameters are all tools to help find that balance. That’s why boosting has a lot of parameters and the need to tune them is one of the downsides of the technique. Luckily, we have a solution on the way. We’ll be connecting Boosted Trees to our Bayesian parameter optimization library (a variant of SMAC), and then we’ll describe how to automatically pick boosting parameters in a future blog post.

Another downside to Boosted Trees are that they’re black box. It’s pretty easy to inspect a decision tree in one of our classic ensembles and understand how it splits up the training data. With boosting, each tree fits a residual of the previous trees, making them near impossible to interpret individually in a meaningful way. However, just like our other tree methods, you can get a feel for what the Boosted Trees are doing by inspecting the field importances measurements. As part of BigML’s prediction service, not only do we build global field importance measures, we also report which fields were most important on a per-prediction basis.


On the advantageous side, BigML’s Boosted Trees support the missing data strategies available with our other tree techniques. If you have data that contains missing values and if those have inherent meaning (e.g. someone decided to leave ‘age’ unanswered in a personals ad), then you may explicitly model the missing values regardless of the field’s type (numeric, categorical, etc.). But if missing values don’t have any meaning, and just mean ‘unknown’, you can use our proportional prediction technique to ignore the impact of the missing fields. This technique is what we use when building our Partial Dependence Plots (or PDPs), which evaluate the Boosted Trees right in your browser to help visualize the impact of the various input fields on your predictions.


We think our Boosted Trees are already a strong addition to the BigML toolkit, but we’ll continue expanding the service to make it even more interpretable via fancier PDPs, easy to use with parameter optimization, and more powerful thanks to customized objective functions.

Want to know more about Boosted Trees?

We recommend that you visit the dedicated release page for more documentation on how to create Boosted Trees, interpret them, and predict with them through the BigML Dashboard and the API; as well as the six blog posts of this series, the slides of the webinar, and the webinar video.

Boosted Trees with WhizzML and Python Bindings


In this fifth post about Boosted Trees, we want to adopt the point of view of a user who feels comfortable using some programming language. If you follow this blog, you probably know about WhizzML or our bindings, which allow for programmatic usage of all the BigML’s platform resources.

Screen Shot 2017-03-15 at 01.51.13

In order to easily automate the use of BigML’s Machine Learning resources, we maintain a set of bindingswhich allow users to work with the platform in their favorite language. Currently, there are 9 bindings for popular languages like Java, C#, Objective C, PHP or Swift. In addition, last year we released WhizzML to help developers create sophisticated Machine Learning workflows and execute them entirely in the cloud thus avoiding network problems, memory issues or lack of computing capacity, while taking full advantage of WhizzML’s built in parallelization. In the past, we wrote about using WhizzML to perform Gradient Boosting and now we are making it even easier to perform with our Winter 2017 release.

In this post, we will show how to use Boosted Trees through both the bindings and WhizzML. In the case of our bindings example, we will use our popular Python binding, but the operations described here are available in all the bindings. Let’s wrap up the preambles and see how to create Boosted Trees without specifying any particular option, just with all default settings.  We need to start from an existing Dataset to create any kind of model in BigML, so our call to the API will need to include the ID of the dataset we want to use. In addition, we’ll need to provide the boosting related parameters. For now, let’s just use the default ones. This is achieved by setting the boosting attribute to an empty map in JSON. We would do that in WhizzML as below,

where ds1 should be a dataset ID. This ID should be provided as input to execute the script.

It’s the same way that you should create a decision tree ensemble, with the difference being the addition of the “boosting” parameter.

In Python bindings the equivalent code is:


Let’s see now how to customize the options of Boosted Trees. To have a list of all properties that BigML offers to customize gradient boosting algorithm, please visit the ensembles page in the API documentation section. In a WhizzML script, the code should include the settings we want to use in a map format. For instance, if we want to adjust all available properties the code should be:


The equivalent code in python bindings would read:


Creation arguments in Python bindings are structured as a dictionary. They are consistent with the natural dictionary representation of JSON objects in the language.

When we were talking about creating Boosted Trees, we explained some applicable parameters that can help you improve your results by proper tuning. It’s very easy to evaluate your Boosted Trees either through WhizzML or the Python bindings: you just need to set the ensemble to evaluate and the test dataset to be used for the evaluation.


Similarly, we can use the Python syntax as follows:


Next up, let’s see how to obtain single predictions from our Boosted Trees once we are past the evaluation stage. For this, we need the ensemble ID and some input data that should be provided with “input_data” parameter. Here’s an example:

s_pred_wz.pngThe equivalent code in Python Bindings would be:


In addition to this prediction, calculated and stored in BigML servers, the Python bindings allow you to instantly create single local predictions in your compute. The Ensemble information is downloaded to your computer the first time it is used, and as  predictions are computed in your machine, there are no additional costs or latency involved. Here is the straightforward code snippet for that:


You can create batches of local predictions by using the predict method in a loop. Alternatively you can upload the new data set you want to predict for to BigML. In this case, results will be stored in the platform when the batch prediction process finishes. Let’s see how to realize this latter option first in Python:


The equivalent code to complete this batch prediction by using WhizzML can be seen below:


A batch prediction comes with configuration options related to the inputs format such as the fields_map, which can be used to map the dataset fields to the ensemble fields especially if they are not identical. Other options affect the output format, like header or separator. You can provide any of these arguments at creation time following the appropriate syntax described in the API documents. We recommend that our readers check out all batch predictions options in the corresponding API documents section.

We hope this post has further encouraged you to start using WhizzML or some of our bindings to more effectively analyze and take action with your data in BigML. We are always open to community contributions to our existing bindings or to any new ones that you think we may not yet support.

Don’t miss our next post if you would like to find out what’s happening behind the scenes of BigML’s Boosted Trees.

To learn more about Boosted Trees or to direct us your questions about WhizzML or the bindings, please visit the dedicated release page for more documentation on how to create Boosted Trees, interpret them, and predict with them through the BigML Dashboard and the API; as well as the six blog posts of this series, the slides of the webinar, and the webinar video. 

Programming Boosted Trees

In this, the fourth of our blog posts for the Winter 2017 release, we will explore how to use boosted Trees from the API. Boosted Trees are the latest supervised learning technique in BigML’s toolbox. As we have seen, they differ from more traditional ensembles in that no tree tries to make a correct prediction on its own, but rather is designed to nudge the overall ensemble towards the correct answer.

This post will be very similar to our second post about using Boosted Trees in the BigML Dashboard. Anything that can be done from the Dashboard can be done with our API. Resources created using the BigML API can all be seen in the Dashboard view as well so you can take full advantage of our visualizations.


If you have never used the API before, you will need to go through a quick setup. Simply set the environment variables BIGML_USERNAME, BIGML_API_KEY and BIGML_AUTH. BIGML_USERNAME is just your username. Your BIGML_API_KEY can be found in the Dashboard by clicking on your username to pull up the Account page, and then clicking on API Key. BIGML_AUTH is set as a combination of the two:


1. Upload Your Data

Just as with the Dashboard, your first step is uploading some data to be processed. You can point to a remote source, or upload directly from your computer in a variety of popular file formats.

To do this, you can use the terminal with curl, or any other command-line tool that takes https methods. In this example, we are uploading the local file we used in our last blog post ‘oscars.csv’.

curl "$BIGML_AUTH"
       -F file=@oscars.csv

2. Create a Dataset

A BigML dataset resource is a serialized form of your data, with some simple statistics already calculated and ready to be processed by Machine Learning algorithms. To create a dataset from your uploaded data, use:

curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"source": "source/58c05080983efc27100012fd"}'

In order to know we are creating a meaningful Boosted Tree, we need to split this dataset into two parts: a training dataset to create the model, and a test dataset to evaluate how the model is doing. We will need two more commands to do just that:

curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"origin_dataset" : "dataset/58c051f6983efc2710001302", \
            "sample_rate" : 0.8, "seed":"foo"}'
curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"origin_dataset" : "dataset/58c051f6983efc2710001302", \
            "sample_rate" : 0.8, “out_of_bag” : true, "seed":"foo"}'

This is pretty similar to how we created our dataset, with some key differences. First, since we are creating these datasets from another dataset, we need to use “origin_dataset”. We are sampling at a rate of 80% for the first training dataset, and then setting “out_of_bag” to true to get the other 20% for the second test dataset. The seed is arbitrary, but we need to use the same one for each dataset.

3. Create Your Boosted Trees

Using the training dataset, we will now make an ensemble. A BigML ensemble will construct Boosted Trees if it is passed a parameter “boosting”, a map of other parameters. In the example below, “boosting” will use ten iterations with a learning rate of 10%.  BigML automatically picks the last field of your dataset as the objective field. If this is incorrect, you will want to explicitly pass it the objective field id. 

curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"dataset": "dataset/58c053ac983efc2708000bbf", \
            "objective_field":"000013", \
            "boosting": {"iterations":10, "learning_rate":0.10}}'

Some other parameters for Boosting include:

  • early_holdout: The portion of the dataset that will be held out for testing at the end of every iteration. If no significant improvement is made on the holdout, Boosting will stop early. The default is zero.
  • early_out_of_bag: Whether Out of Bag samples are tested after every iteration and may result in an early stop if no significant improvement is made. To use this option, an “ensemble_sample” must also be requested. The default is true.
  • ensemble_sample: The portion of the input dataset to be sampled for each iteration in the ensemble. The default rate is 1, with replacement true.

For example, we will try setting “early_out_of_bag” to true. To do this, we will also have to set an “ensemble_sample”, say to 65%. This looks like:

curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"dataset": "dataset/58c053ac983efc2708000bbf", \
            "objective_field":"000013", \
            "boosting": {"iterations":10, "learning_rate":0.10, "early_out_of_bag":true} \
            "ensemble_sample": {"rate": 0.65, "replacement": false, "seed": "foo"}}'

4. Evaluate your Boosted Trees

In order to see how well your model is performing, you will need to evaluate it against some test data. This will return an evaluation resource with a result object. For classification models, this will include accuracy, average_f_measure, average_phi, average_precision, average_recall, and a confusion_matrix for the model. So that we can be sure the model is making useful predictions, we include these same statistics for two simplistic alternative predictors: one that picks random classes and one that always picks the most common class. For regression models, we include the average_error, mean_squared_error, and r_squared. Similarly we compare regression models to a random predictor and a predictor which always chooses the mean.

curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"dataset": "dataset/58c0543e983efc2702000c51", \
            "ensemble": "ensemble/58c05480983efc2710001306"}'

5. Make Predictions

Once you are satisfied with your evaluation, you can create one last Boosted Trees model with your entire dataset. Now it is ready to make predictions on some new data. This is done in similar fashion to other BigML models.

curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"ensemble": "ensemble/58c05480983efc2710001306", \
            "dataset": "dataset/58c0543e983efc2702000c51"}'

In our next post of the series, we will see how to automate these steps with WhizzML, BigML’s domain-specific scripting language, and the Python Bindings.

To find out exactly how Boosted Trees work, please visit the dedicated release page for more documentation on how to create Boosted Trees, interpret them, and predict with them through the BigML Dashboard and the API; as well as the six blog posts of this series, the slides of the webinar, and the webinar video. 

Boosting the Oscars

In this blog post, the third one in our six post series on Boosted Trees, we will bring the power of Boosted Trees to a specific example. As we have seen in our previous post, Boosted Trees are a form of supervised learning that combine every tree in the ensemble additively to answer classification and regression problems. With BigML’s simple and beautiful dashboard visualizations, we’ll revisit our answer to who will win the Oscar for Best Actor.

The Data

Already engineered for our recent Oscar predictions post, we took data from many sources, particularly including many related awards, to see if we can answer one of the biggest questions in Hollywood: who will win at the Oscars this year? We did generally well with our Random Decision Forests. Of the eight categories we attempted, we got five correct and another two were knife’s edge calls between the winner and our picks. But can we do even better with Boosted Trees?

The Chart

One major way Boosted Trees differ from Random Decision Forests is that there are more parameters than can be changed. This is both powerful as we can tune the tree to exactly what we want but also intimidating as there are so many knobs to turn! In a future blog post, we will show how to automatically choose those parameters.  In this example, however, we will be working with the iterations slider.

As we have seen, Boosted Trees work by using every iteration to improve on the previous one. This may seem like more iterations are always better, however, this is not always the case. In some cases, we could slowly be stepping toward some optimal answer, but our improvements are so slight with each iteration that it’s not worth the time invested in them. So how to know when to stop? That’s what early stopping does for us. BigML has two forms of early stopping, Holdout and Out of Bag. Holdout reserves some subset of the training data to evaluate how far we have come with each iteration. If the improvement is minimal, the ensemble stops building. It then reruns using all of the data for the chosen number of iterations. Out of Bag uses some of the training data that is not currently being used to build this iteration to gauge the improvement. It is faster than Holdout early stopping, in general, but because it is reusing data that was used for training in earlier iterations it is not as clean a test.

In this example, we chose just 10 iterations with a learning rate of 30%. In general, lower learning rates can help find the best solutions, but need more iterations to get there. Our example also uses the Out of Bag early stopping option.

post 3 - 01

With the Ensemble Summary Report we can see that the two most important fields to this decision are the number of Oscar Categories Nominated and whether it had a Best Actor Nomination.

With the field importance chart, we can also see what other categories are important: Reviews, BAFTA winner, Screen Actors Guild winner, and LA Film Critics Association nominee. We can already see an aberration with this model; clearly an actor must be nominated for best actor to win the award. So we’d expect that to be the most important field, not the second.

Looking at the PDP, we see it is broken into four main sections. The two bluish sections are where the probability is greatest that the movie doesn’t win a Best Actor award, while the red sections are where the probability is that it does. Again, something strange is going on here. The upper right quadrant is coded red which means the model believes an actor could win the award even without a nomination!

Let’s create a different Boosted Tree, this time with 500 iterations and a 10% learning rate. As before, we will employ tree sampling at 65%, building each iteration on a subset of the total training data. For classification problems, there is one tree per class per iteration, for regression problems, just one tree per class.

post 3 - 03

Already we see an improvement. Whether the film is nominated for a Best Actor Oscar is now the most important field. The other top fields include whether it won a Screen Actors Guild award for Best Actor, User Reviews, and its overall rating. This is very different from our first example, which relied heavily on other awards. We also see, as we expect, that movies that didn’t get nominated will not get a best actor award.


But what exactly do our Boosted Trees predict? Looking just at the more promising second model, we can create a Batch Prediction with the movies data just from 2016.

post 3 - 04

In order to get the probabilities of each row, we will go under Configure, and then Output Settings to select the percent sign icon. This will add two columns to our output dataset, one for each class in our objective field: the probability that the movie wins a Best Actor Oscar and the probability that it does not. This way, we can see not only whether the model predicts a win, but also by how much.

post 3 - 05

Our Boosted Trees predict… drumroll please… four different actors might win the Oscar! That is, four different actors have a very good chance of winning. Let’s see who we have: Ryan Gosling in La La Land, Denzel Washington in Fences, Andrew Garfield in Hacksaw Ridge, and finally Casey Affleck in Manchester by the Sea.

Here are the normalized probabilities. All four of these candidates are within a few percent of each other, with Mr. Affleck perhaps the furthest behind. No wonder our model picked four winners! And no wonder we had such a hard time predicting the win with our Random Decision Forest. The race was simply too close to call until the big night.

In the next post, we will see how to create Boosted Trees from the BigML API.

Would you like to know more about Boosted Trees? Please visit the dedicated release page for more documentation on how to create Boosted Trees, interpret them, and predict with them through the BigML Dashboard and the API; as well as the six blog posts of this series, the slides of the webinar, and the webinar video. 

The Six Steps to Boosted Trees

BigML is bringing Boosted Trees to our ever-growing suite of supervised learning techniques. Boosting is a variation on ensembles that aims to reduce bias, potentially leading to better performance than Bagging or Random Decision Forests.

In our first blog post of this series of six posts about Boosted Trees, we saw a gentle introduction to Boosted Trees to get some context about what this new resource is and how it can help you solve your classification and regression problems. This post will take us further, into the detailed steps of how to use boosting with BigML.


Step 1: Import Your Data

To learn from our data, we must first upload it. There are several ways to upload your data to BigML. The easiest is to navigate to the Dashboard and click on the sources tab on the far left. From there you can create a source by importing from Google Drive, Google Storage, Dropbox, MS Azure. If your dataset is not terribly large, creating an inline source by directly typing in the data may appeal to you. You can also create a source from a remote URL, or by uploading a local file (of format .csv, .tsv, .txt, .json, .arff, .data, .gz, or .bz2).

post 2 - 01

Step 2: Create Your Dataset

Once a file is uploaded as a source, it can be turned into a dataset. From your Source view, use the 1-click Dataset option to create a dataset, a structured version of your data ready to be used by a Machine Learning algorithm.

post 2 - 02.png

In the dataset view, you will be able to see a summary of your field values, some basic statistics and the field histograms to analyze your data distributions. This view is really useful to see any errors or irregularities in your data. You can filter the dataset by several criteria and even create new fields from your existing data.

post 2 - 03

Once your data is free of errors you will need to split your dataset into two different subsets: one for training your Boosted Trees, and the other for testing. It is crucial to train and evaluate supervised learning models with different data to get a true evaluation and not be tricked by overfitting. You can easily split your dataset using the BigML 1-click option or the configure option menu, which randomly splits 80% of the data for training and sets aside 20% for testing.

post 2 - 04.png

Step 3: Create Your Boosted Trees

To create Boosted Trees, make sure you are viewing the training split of your dataset, click on Configure Ensemble under the configure option menu. By default, the last field of your dataset is chosen as the objective field, but you can easily change this with the dropdown on the left. To enable boosting, under Type choose Boosted Trees. This will open up the Boosting tab under Advanced Configuration.

post 2 - 05

You can, of course, now use the default settings and click on Create Ensemble. But Machine Learning is never at its most powerful without you, the user, bringing your own domain-specific knowledge to the problem. You will get the best results if you ‘turn some knobs’ and alter the default settings to suit your dataset and problem (in a later blog post we’ll discuss automatically finding good parameters).

post 2 - 06.png

BigML offers many different parameters to tune. One of the most important is the number of iterations. This controls how many individual trees will be built; one tree per iteration for regression and one tree per class per iteration for classification.

Other parameters that can be found under Boosting include:

  • Two forms of early stopping: These will keep the ensemble from performing all the iterations, saving running time and perhaps improving performance. Early Holdout tries to find the optimal stopping time by completely reserving a portion of the data to test at each iteration for improvement, Early Out of Bag simply tests against the out of bag data (data not used in the tree sampling).
  • The Learning Rate: Default is 10%, and the learning rate controls how far to step in the gradient direction. In general, a smaller step size will lead to more accurate results, but will take longer to get there.

Another useful parameter to change is found under Tree Sampling:

  • The Ensemble Rate option ensures that each tree is only created with a subset of your training data, and generally helps prevent overfitting.

Step 4: Analyze Your Boosted Trees

Once your Boosted Trees are created, the resource view will include a visualization called a partial dependence plot, or PDP. This chart ignores the influence of all but the two fields displayed on the axes. If you want other fields to influence the results, you can select them by checking the box in the input fields section or by making them an axis.

post 2 - 07.png

The axes are initially set to the two most important fields. You can change the fields at any time by using the dropdown menus near the X and Y. Each region of the grid is colored based on the class and probability of its prediction. To see the probability in more detail, mouse over the grid and the exact probability appears in the upper righthand area.

Step 5: Evaluate Your Boosted Trees

But how do you know if your parameters are indeed tuned correctly? You need to evaluate your Boosted Trees by comparing its predictions with the actual values seen in your test dataset.

post 2 - 08.png

To do this, in the ensemble view click on Evaluate under the 1-click action menu. You can change the dataset to evaluate it against, but the default 20% test dataset is perfect for this procedure. Click on Evaluate to execute and you will see the familiar evaluation visualization, dependent on whether your problem was a classification or regression.

post 2 - 09.png

Step 6: Make Your Predictions

When you have results you are happy with, it’s time to make some predictions. Create more Boosted Trees with the parameters set the way you like, but this time run it on the entire dataset. This will mean all your data is informing your decisions.

Boosted Trees differ from our other ensemble predictions because they do not return confidence (for classification) but rather the probabilities for all the classes in the objective field.

Now you can make a prediction on some new data. Just as with BigML’s previous supervised learning models, you can make a single prediction for just one instance, or a batch prediction for a whole dataset.

post 2 - 10

In the ensemble view, click on Prediction (or Batch Prediction) under the 1-click action menu. The left hand side will already have your Boosted Trees. Choose the dataset you wish to run your prediction on from the dropdown on the right. You can, of course, customize the name and prediction output settings. Scroll down to click on Predict to create your prediction.

post 2 - 11.png

In the next post, we will see these six steps in action when BigML takes boosting to the Oscars. Stay tuned!

Would you like to find out exactly how Boosted Trees work? Please visit the dedicated release page for more documentation on how to create Boosted Trees, interpret them, and predict with them through the BigML Dashboard and the API; as well as the six blog posts of this series, the slides of the webinar, and the webinar video. 

Introduction to Boosted Trees

We are happy to share that BigML is bringing Boosted Trees to the Dashboard and the API as part of our Winter 2017 Release. This newest addition to our ensemble-based strategies is a supervised learning technique that can help you solve your classification and regression problems even more effectively.

To best inform you about our impending launch, we have prepared a series of blog posts that will help you get a good understanding prior to the official launch. Today, we start with the basic concepts of Boosted Trees. Subsequent posts will gradually dive deeper to help you become a master of this new resource by: demonstrating how to use this technique with the BigML Dashboard, presenting a use case that will help you discern when to apply Boosted Trees, how to use Boosted Trees with the API, as well as how to properly automate it with WhizzML. Finally, we will conclude with a detailed technical view of how BigML Boosted Trees work under the hood.

Let’s begin our Boosted Trees journey!

Why Boosted Trees?

First of all, let’s recap the different strategies based on single decision trees and ensembles BigML offers to solve classification and regression problems, and find out which technique is the most appropriate to achieve the best results depending on the characteristics of your dataset.

  • BigML Models were the first technique BigML implemented, they use a proprietary decision tree algorithm based on the Classification and Regression Trees (CART) algorithm proposed by Leo Breiman. Single Decision Trees are composed of nodes and branches that create a model of decisions with a tree graph. The nodes represent the predictors or labels that have an influence in the predictive path, and the branches represent the rules followed by the algorithm to make a given prediction. Single decision trees are a good choice when you value the human-interpretability of your model. Unlike many ML techniques, individual decision trees are easy for a human to inspect and understand.

  • Bagging (or Bootstrap Aggregating), the second prediction technique brought to the BigML Dashboard and API, uses a collection of trees (rather than a single one), each tree built with a different random subset of the original dataset for each model in the ensemble. Specifically, BigML defaults to a sampling rate of 100% (with replacement) for each model. This means some of the original instances will be repeated and others will be left out. Bagging performs well when a dataset has many noisy features and only one or two are relevant. In those cases, Bagging will be the best option.

  • Random Decision Forests extend the Bagging technique by only considering a random subset of the input fields at each split of the tree. By adding randomness in this process, Random Decision Forests help avoid overfitting. When there are many useful fields in your dataset, Random Decision Forests are a strong choice.

In Bagging or Random Decision Forests, the ensemble is a collection of models, each of which tries to predict the same field, the problem’s objective field. So depending on whether we are solving a classification or a regression problem, our models will have a categorical or a numeric field as objective. Each model is built on a different sample of data (if Random Decision Forests also using a different sample of fields in each split), so their predictions will have some variation. Finally, the ensemble will issue a prediction using component model predictions as votes aggregating them through different strategies (plurality, confidence weighted or probability weighted).

The boosting ensemble technique is significantly different. To begin with, the ensemble is a collection of models that do not predict the real objective field of the ensemble, but rather the improvements needed for the function that computes this objective. As shown in the image above, the modeling process starts by assigning some initial values to this function, and creates a model to predict which gradient will improve the function results. The next iteration considers both the initial values and these corrections as its original state, and looks for the next gradient to improve the prediction function results even further. The process stops when the prediction function results match the real values or the number of iterations reaches a limit. As a consequence, all the models in the ensemble will always have a numeric objective field, the gradient for this function. The real objective field of the problem will then be computed by adding up the contributions of each model weighted by some coefficients. If the problem is a classification, each category (or class) in the objective field has its own subset of models in the ensemble whose goal is adjusting the function to predict this category.

For Bagging and Random Decision Forests each tree is independent from one another making them easy to construct in parallel. Since a boosted tree depends on the previous trees, a Boosted Tree ensemble is inherently sequential. Nonetheless, BigML parallelizes the construction of individual trees. That means even though boosting is a computation heavy model, we can train Boosted Trees relatively quickly.

How do Boosted Trees work in BigML?

Let’s illustrate how Boosted Trees work with a dataset that predicts the unemployment rate in US. Before creating the Boosted Tree ensemble we split the dataset in two parts: one for training and one for testing. Then we train a Boosted Tree ensemble with 5 iterations by using the training subset of our data. Thanks to the Partial Dependence Plot (PDP) visualization, we observe that the higher the “Civilian Employment Population Ratio” the lower the unemployment rate, and vice versa. 

But, can we trust this prediction? To get an objective measure of how well our ensemble is predicting, we evaluate it in BigML as any other kind of supervised model. The performance of our first attempt is not as good as it could be. A simple way of trying to improve it is to increase the number of iterations the boosting algorithm performs in order to improve its tree. Remember that, in our first attempt, we set that number to the very conservative value of 5. Let’s create another Boosted Tree, but this time with 400 iterations. Voila! If we compare the evaluations of both models, as shown in the image below, we observe a very significant performance boost, both in the lower absolute and squared errors and in the higher R squared value for our ensemble with 400 iterations.

Stay tuned for the upcoming posts to find out how to create Boosted Trees in BigML, interpret and evaluate them in order to make better predictions.

In Summary

To wrap up this blog post we can say that Boosted Trees:

  • Are a variation of tree ensembles, where the tree outputs are additive rather than averaged (or majority voted).
  • Do not try to predict the objective field directly. Instead, they try to fit a gradient by correcting mistakes made in previous iterations.
  • Are very useful when you have a lot of data and you expect the decision function to be very complex. The effect of additional trees is basically an expansion of the hypothesis space beyond other ensemble strategies like Bagging or Random Decision Forests.
  • Help solve both classification and regression problems, such as: churn analysis, risk analysis, loan analysis, fraud analysis, sentiment analysis, predictive maintenance, content prioritization, next best offer, lifetime value, predictive advertising, price modeling, sales estimation, patient diagnoses, or targeted recruitment, among others.

For more examples on how Boosted Trees work we recommend that you read this blog post as well as this alternative explanation, which contains a visual example.

Want to know more about Boosted Trees?

Please visit the dedicated release page for more documentation on how to create Boosted Trees, interpret them, and predict with them through the BigML Dashboard and the API; as well as the six blog posts of this series, the slides of the webinar, and the webinar video.

BigML Winter 2017 Release and Webinar: Boosted Trees!

BigML’s Winter 2017 Release is here! Join us on Tuesday, March 21, at 10:00 AM PDT (Portland, Oregon. GMT -07:00) / 06:00 PM CET (Valencia, Spain. GMT +01:00) for a FREE live webinar to discover the enhanced version of BigML! We’ll be announcing BigML’s Boosted Trees, the third ensemble-based strategy that BigML provides to help you easily solve your classification and regression problems.


Together with Bagging and Random Decision Forests, Boosted Trees make for a powerful tool for both the BigML Dashboard and our REST API. With Boosted Trees, tree outputs are additive rather than averaged (or decided by majority vote). Individual trees in a Boosted Tree differ from trees in bagged or random forest ensembles, since they do not try to predict the objective field directly. Instead, they try to fit a ‘gradient’ to correct mistakes made in previous iterations. This unique technique, where each tree improves on the imperfect predictions of the previously grown tree, lets you predict both categorical and numeric fields.


This latest addition to BigML’s toolset is visualized with a Partial Dependence Plot (PDP) chart, a graphic representation of the marginal impact of a set of variables (input fields) on the ensemble predictions irrespective of the rest of the input variables. It is a common method for visualizing and interpreting the marginal impact of the variables on ensemble predictions as well as their interactions with the rest of input fields. BigML’s Boosted Trees will also contain an importance attribute that lists each field importance in the same format used by the rest of our models and ensemble types. This option allows you to inspect and analyze the features that are most important to predict the objective field.


Just like the other BigML supervised learning models, Boosted Trees offer Single Predictions to predict a given single instance and Batch Predictions to predict multiple instances simultaneously. And now all our classification ensembles, from single trees to Boosted Trees, will return not just a single class along with its confidence, but also a set of probabilities for the rest of the classes in the objective field. What is more, each class probability will be shown in the predictions histogram.

Would you like to find out exactly how Boosted Trees work? Join us on Tuesday, March 21, at 10:00 AM PDT (Portland, Oregon. GMT -07:00) / 06:00 PM CET (Valencia, Spain. GMT +01:00). Be sure to reserve your FREE spot today as space is limited! Following our tradition, we will also be giving away BigML t-shirts to those who submit questions during the webinar. Don’t forget to request yours!

PreSeries’ Algorithm Chooses Pixoneye as the Startup Most Likely to Succeed

PreSeries, the joint venture between Telefónica Open Future_ and BigML, staged the fourth edition of the Artificial Intelligence Startup Battle yesterday (Tuesday, February 28). The event took place at the main stage of the 4 Years From Now (4YFN) event, the startup focused platform of the Mobile World Congress that enables investors and corporations to connect with successful entrepreneurs to launch new ventures together. More than 500 attendees witnessed this unique battle, where no humans were involved in assessing the contestants. Instead, the Machine Learning algorithm of PreSeries chose the winner.

This fourth edition followed the footsteps of the previous AI Startup Battles, the first one celebrated in Valencia last March 15, 2016, the second one in Boston on October 12, 2016, and the third one in Sao Paulo on December 9, 2016, where PreSeries’ algorithm asked a number of dynamically selected questions to each contender in order to provide a score between 0 and 100. The startup with the highest score won the contest as the system deemed it to be the one with most likelihood of future success. The predictions are based on historical data from more than 350.000 companies from around the world.


Pixoneye, the winner of the Artificial Intelligence Startup Battle at 4 Years From Now conference. Ana Segurado, the Global Manager of Telefónica Open Future_ gives away the award (left) to Pixoneye’s Erin Bronstein (right).

With a score of 96.63, the winner was announced as Pixoneye. Pixoneye is based in London and Tel Aviv; and offers the ability to analyze the untapped power of mobile users’ photo galleries on behalf of their clients. The second place finisher was with a score of 94.00, an English company that develops chatbots that revolutionises customer interactions for businesses. The third position, with 67.82 points, was for of London. gives people ownership of their data to enable the next phase in the evolution of human connectivity. Finally, the fourth placed contestant (with a score of 61.23) was Descifra of Mexico, which helps businesses understand the characteristics of the markets around them through easy to understand charts, tables, and maps.

As in previous battles, the audience enthusiastically warmed up to the idea of an AI system judging the contestants after the pitch sessions and was excited to witness Pixoneye being crowned the latest AI Startup Battle winner. 

2017 Oscar Predictions Post-mortem

Over the weekend, we saw an eventful 89th Academy Award Ceremony wrapped up. Despite underwhelming viewership counts of the televised event, it will likely be remembered for a long time for the remarkable mishap in the end.

Oscars 2017 Confusion

As for our predictions performance, we got 5 out of 8 predictions right. While not phenomenal, it wasn’t such a bad for a performance given that this has been our first stab in this domain. With that said, it is important to see where and why we failed so we get to improve on it next year.

Best Movie: Moonlight

The best movie prediction is hard to explain within the confines of our dataset.  We surely were not the only ones to not see this one coming given that even PricewaterhouseCooper fumbled.

We still feel that La La Land had all the ingredients that have historically given a film the win in this category. Perhaps (though not a guarantee) if we had included the Independent Film Awards data, we may have predicted Moonlight as the winner with a slight edge. But the most plausible reason why we haven’t predicted Moonlight is because we didn’t have any variables accounting for socio psychological aspects of the awards. The changes to the voting body of The Academy in response to campaigns like #OscarsSoWhite and the sustained criticism of the Academy’s conservatism may have indeed made just enough of a difference in the final decision between two well-deserving candidates.

Best Actor: Casey Affleck, Manchester by the Sea

Our best actor prediction miss can be explained because Casey Affleck and Denzel Washington shared the prizes that historically have the highest predictive power among each other. Denzel Washington won the Screen Actors Guild Award and was nominated for all of the rest of awards considered by the model while Affleck won the Golden Globes, BAFTA ,Critics Choice, Online Film & TV Association awards among four others. Although Casey Affleck won many more awards, the Screen Actors Guild had a particularly high importance in all models. Two of our models with different weights for each of those prizes gave us different predictions. Both models had exactly the same evaluation performance (100% accuracy). Denzel Washington’s prediction had a higher confidence so we went with Denzel Washington, but we could have just as easily chosen Casey Affleck.

Best Adapted Screenplay: Moonlight

This was the most difficult category to predict, as it seemed difficult to infer a general pattern that applied consistently over history. Arrival was nominated for the BAFTA (Lion won), which shows up as the most important variable, and it also won The Critic’s Choice and The Writers Guild awards.  However, the eventual winner, Moonlight, was nominated for the Best Original Screenplay at BAFTA, and not this category. Go figure!

Finally, Moonlight sneaked in and pulled this one off from our pick of Arrival perhaps as result of the halo effect of the overall popular support for this year’s low budget wonder that could. We don’t have any variables that account for such correlations among the various categories, but it is less common to see extreme fragmentation of the awards in a given year, i.e., if one thinks a given movie is the best in season, she’s more likely to attribute that to multiple related factors that together make for a good movie.

We’ll chew on these lessons for next year’s predictions. In the meanwhile, happy movie watching and Machine Learning modeling endeavors to all of you!

%d bloggers like this: