Skip to content

2ML: Discover the Applications of Machine Learning for your Business

Machine Learning is fast impacting all business sectors. In fact, the right application of Machine Learning can significantly transform any business data into actionable insights, which helps enterprises grow their businesses as decision makers can more consistently make the right decisions at the right time.

The continuous evolution of predictive applications allow companies to foresee what’s about to happen next. Yet this represents half the challenge as businesses need to also decide what to do with the distilled insights. In some cases, highly skilled humans are replaced by machines that can perform certain complicated tasks better and more efficiently than humans. Other times, the application of Machine Learning in combination with humans can lead to the best outcomes.

To raise awareness about the optimal ways to incorporate Machine Learning in your business processes, the innovative consultancy Barrabés and BigML are co-organizing 2ML, the Machine Learning event for decision makers, technology professionals, and other industry practitioners who are interested in boosting their work productivity by applying Machine Learning techniques.

2ML will bring together 400 attendees that will gather to hear from some of the brightest minds in the Machine Learning field, such as Professor Thomas Dietterich or Susanna Pirttikangas, from University of Oulu, among others.

2ML Agenda

The conference will start with a global view of the origin, present and future of Machine Learning in the corporate environment, explaining how and why Machine Learning will have an impact on your business. In the afternoon, after the lunch break, we will continue with four parallel sessions on the application of Machine Learning in various industries such as Finance, Telecom, Technology, Marketing, Sales, Sports, and Industry. Each vertical will have two different presentations and a panel, where the speakers will discuss the impact of Machine Learning in their sector and will answer questions from the audience.

For more information on the talks, speakers and panels from the morning sessions and the four verticals, please visit the dedicated event page.

Want to know more about #2ML17?

Discover the impact that Machine Learning is going to have on your business and find out how you can take advantage of it to take your business to the next level. Join us at #2ML17 on May 11 in Madrid, Spain. Please note that purchasing your ticket before April 24 will get you a 30% discount!

PAPIs Connect 2017 – Call for Proposals

PAPIs Connect, Latin America’s 1st conference on real-world Machine Learning applications goes to São Paulo, Brazil, on June 21-22, 2017. We are now calling for proposals to select the best talks, ideas and applications that will be shown at the Telefonica Auditorium.

PAPIs Connect is a series of more localized events that run in between the annual PAPIs conference events, the International Conference on Predictive Applications and APIs. PAPIs Connect is followed by those decision makers and developers who are interested in building real-world intelligent applications and want to find out about the latest technology.

Are you passionate about technology and predictive applications? Would you like the world to know of your contribution to the practice of Artificial Intelligence? Then, this is your place to be!

We are always excited to see practical presentations about:

  • Innovative Machine Learning use cases
  • Challenges and lessons learnt in integrating Machine Learning into various applications / processes / businesses and new areas; this can include technical and domain-specific challenges, as well as those related to fairness, accountability, transparency, privacy, etc.
  • Techniques, architectures, infrastructures, pipelines, frameworks, API design to create better predictive / intelligent applications (from embedded to web-scale)
  • Tools to democratize Machine Learning and make it easier to build into products
  • Needs, trends, opportunities in this space
  • Tutorials that teach a specific and valuable skill

If you think you can be a good candidate, please submit your proposal before April 23, 12:00 AM (São Paulo BRT / GMT -3) and share your story with a great audience that will appreciate innovation as much as you do.

AI Startup Battle

The attendees at PAPIs Connect will also enjoy the 5th edition of our Artificial Intelligence Startup Battle, where there will not be any human intervention. The jury is an algorithm and will predict the probability of success of the early stage startups competing on stage. To know more about the format of the battle, please check our previous AI Startup Battles: the world premiere took place in Valencia, Spain, on March 2016 at PAPIs Connect, the second one in Boston, US, on October 2016 at PAPIs, the third one in São Paulo, Brazil, on December 2016, and the fourth one in Barcelona, Spain at the 4YFN.

These competitions have been powered by PreSeries, the automated platform to discover, evaluate, and monitor early stage investments, a joint venture between BigML and Telefónica Open Future_. We will announce all the details for the fifth edition soon. Stay tuned!

Previous PAPIs and PAPIs Connect Conferences

The PAPIs conference series started in November 2014 in Barcelona. Since then, PAPIs and PAPIs Connect have traveled around the world, providing interesting talks to a distinguished audience in Boston, Sydney, Barcelona, Paris and Valencia. The next stop will be São Paulo in June 2017 and Boston in October 2017. Please visit the PAPIs website for further details.

SAIC Motor takes a strategic stake in BigML

Today, we are happy to share BigML has secured a strategic investment from SAIC Capital, the corporate venture of SAIC Motor Corporation Limited (SAIC Motor) – the $110B company that leads the automotive design and manufacturing industry in China. As part of the investment, Tao Wang, Director of Investment, is joining BigML’s board of directors.


This is an important milestone in BigML’s journey that started back in 2011 in Corvallis, Oregon, the home of Oregon State University (OSU). Since our inception, we have been making Machine Learning easy and beautiful for everyone by steadily advancing our platform. BigML now reaches over 40,000 analysts, developers and scientists that are discovering the hidden insights in their data and building intelligent applications in 120 countries around the world.

SAIC Motor’s investment in BigML further proves that leading global enterprises see Machine Learning as a key enabler of their future competitive performance. The future belongs to the predictive businesses with operations that are increasingly run by automated processes that are powered by Machine Learning. It is now clearer than ever that this is not a matter of choice, but an imperative across a broad swath of industries.

In 2016 SAIC Motor sold 6.4 million vehicles leading the Chinese market continuously.  SAIC Motor is a Fortune Global 500 company as its ranking had been rising to 46th place in 2015. It marked the 12th time that the company had made it onto the list of the world’s largest companies. Today, SAIC Motor has also set its sights on the automotive future that is being evolving through electrification, autonomous driving, and intelligence human interface, big data analytics capabilities more and more defined by Machine Learning.

About SAIC Motor

SAIC Motor Corporation Limited (SAIC Motor) is the largest auto company on China’s A-share market (Stock Code: 600104), SAIC Motor’s business covers the research, production and vehicle sales of both passenger cars and commercial vehicles. It also covers components including engines, gearboxes, powertrains, chassis, interior and exterior and miscellaneous electronic components, and logistics, vehicle telematics, second-hand vehicle transactions and auto finance services.  Major vehicle companies under the SAIC Motor umbrella include SAIC GM, SAIC VW, SAIC Motor Passenger Vehicle Company, SAIC Motor Commercial Vehicle Company, SAIC-GM Wuling Automobile Co and others. SAIC Motor’s North America business includes a division based in Michigan providing logistics and supply chain services, an investment division in Menlo Park, CA, and an advanced technology research and development center based in San Jose, CA.

BigML Winter 2017 Release Webinar Video is Here!

As announced in our latest blog posts, Boosted Trees is the new supervised learning technique that BigML offers to help you solve your classification and regressions problems. And it is now up and running as part of our set of ensemble-based strategies available through the BigML Dashboard and our REST API.

If you missed the webinar that was broadcasted yesterday, here you have another chance to follow our latest addition. In fact, you can play it anytime you wish since it’s available on the BigML Youtube channel.

Please visit our dedicated Winter 2017 Release page for more learning resources, including:

  • The Boosted Trees documentation to learn how to create, interpret and make predictions with this algorithm, from both the BigML Dashboard and the BigML API.
  • The series of six blog posts that guide you in the Boosted Trees journey step by step. Starting with the basic concepts of this algorithm and the differences between the other ensembles we offer; continuing with a use case and several examples of how to use Boosted Trees through the Dashboard, API, or how to automate the workflows with WhizzML and the Python Bindings; and finally wrapping up with the most technical side of how Boosted Trees work behind the scenes.

Many thanks for your attention, your questions, and the positive feedback after the webinar. We cannot wait to announce the next release!

The Down Low on Boosting

If you’ve been following our blog posts recently, you know that we’re about to release another variety of ensemble learner, Boosted Trees. Specifically, we’ve implemented a variation called gradient boosted trees.

Let’s quickly review our existing ensemble methods. Decision Forests take a dataset with either a categorical or numeric objective field and build multiple independent tree models using samples of the dataset (and/or fields). At prediction time, each model gets to vote on the outcome. The hope is that the mistakes of each tree will be independent from one another. Therefore, in aggregate, their predictions will come to the correct answer. In ML parlance, this is a way to reduce the variance.


With Boosted Trees the process is significantly different. The trees are built in serial and each tree tries to correct for the mistakes of the previous. When we make a prediction for a regression problem, the individual Boosted Trees are summed to find the final prediction. For classification, we sum up pseudo-probabilities for each class and run those results through Softmax to create final class probabilities.


Each iteration makes our boosted meta-model more complex. That additional complexity can really pay off for datasets with nuanced interactions between the input fields and the objective. It’s more powerful, but with that power comes the danger of overfitting, as boosting can be quite sensitive to noise in the data.

Many of the parameters for boosting are tools for balancing the power versus the risk of overfitting. Sampling at each iteration (BigML’s ‘Ensemble Sample’), the learning rate, and the early holdout parameters are all tools to help find that balance. That’s why boosting has a lot of parameters and the need to tune them is one of the downsides of the technique. Luckily, we have a solution on the way. We’ll be connecting Boosted Trees to our Bayesian parameter optimization library (a variant of SMAC), and then we’ll describe how to automatically pick boosting parameters in a future blog post.

Another downside to Boosted Trees are that they’re black box. It’s pretty easy to inspect a decision tree in one of our classic ensembles and understand how it splits up the training data. With boosting, each tree fits a residual of the previous trees, making them near impossible to interpret individually in a meaningful way. However, just like our other tree methods, you can get a feel for what the Boosted Trees are doing by inspecting the field importances measurements. As part of BigML’s prediction service, not only do we build global field importance measures, we also report which fields were most important on a per-prediction basis.


On the advantageous side, BigML’s Boosted Trees support the missing data strategies available with our other tree techniques. If you have data that contains missing values and if those have inherent meaning (e.g. someone decided to leave ‘age’ unanswered in a personals ad), then you may explicitly model the missing values regardless of the field’s type (numeric, categorical, etc.). But if missing values don’t have any meaning, and just mean ‘unknown’, you can use our proportional prediction technique to ignore the impact of the missing fields. This technique is what we use when building our Partial Dependence Plots (or PDPs), which evaluate the Boosted Trees right in your browser to help visualize the impact of the various input fields on your predictions.


We think our Boosted Trees are already a strong addition to the BigML toolkit, but we’ll continue expanding the service to make it even more interpretable via fancier PDPs, easy to use with parameter optimization, and more powerful thanks to customized objective functions.

Want to know more about Boosted Trees?

We recommend that you visit the dedicated release page for more documentation on how to create Boosted Trees, interpret them, and predict with them through the BigML Dashboard and the API; as well as the six blog posts of this series, the slides of the webinar, and the webinar video.

Boosted Trees with WhizzML and Python Bindings


In this fifth post about Boosted Trees, we want to adopt the point of view of a user who feels comfortable using some programming language. If you follow this blog, you probably know about WhizzML or our bindings, which allow for programmatic usage of all the BigML’s platform resources.

Screen Shot 2017-03-15 at 01.51.13

In order to easily automate the use of BigML’s Machine Learning resources, we maintain a set of bindingswhich allow users to work with the platform in their favorite language. Currently, there are 9 bindings for popular languages like Java, C#, Objective C, PHP or Swift. In addition, last year we released WhizzML to help developers create sophisticated Machine Learning workflows and execute them entirely in the cloud thus avoiding network problems, memory issues or lack of computing capacity, while taking full advantage of WhizzML’s built in parallelization. In the past, we wrote about using WhizzML to perform Gradient Boosting and now we are making it even easier to perform with our Winter 2017 release.

In this post, we will show how to use Boosted Trees through both the bindings and WhizzML. In the case of our bindings example, we will use our popular Python binding, but the operations described here are available in all the bindings. Let’s wrap up the preambles and see how to create Boosted Trees without specifying any particular option, just with all default settings.  We need to start from an existing Dataset to create any kind of model in BigML, so our call to the API will need to include the ID of the dataset we want to use. In addition, we’ll need to provide the boosting related parameters. For now, let’s just use the default ones. This is achieved by setting the boosting attribute to an empty map in JSON. We would do that in WhizzML as below,

where ds1 should be a dataset ID. This ID should be provided as input to execute the script.

It’s the same way that you should create a decision tree ensemble, with the difference being the addition of the “boosting” parameter.

In Python bindings the equivalent code is:


Let’s see now how to customize the options of Boosted Trees. To have a list of all properties that BigML offers to customize gradient boosting algorithm, please visit the ensembles page in the API documentation section. In a WhizzML script, the code should include the settings we want to use in a map format. For instance, if we want to adjust all available properties the code should be:


The equivalent code in python bindings would read:


Creation arguments in Python bindings are structured as a dictionary. They are consistent with the natural dictionary representation of JSON objects in the language.

When we were talking about creating Boosted Trees, we explained some applicable parameters that can help you improve your results by proper tuning. It’s very easy to evaluate your Boosted Trees either through WhizzML or the Python bindings: you just need to set the ensemble to evaluate and the test dataset to be used for the evaluation.


Similarly, we can use the Python syntax as follows:


Next up, let’s see how to obtain single predictions from our Boosted Trees once we are past the evaluation stage. For this, we need the ensemble ID and some input data that should be provided with “input_data” parameter. Here’s an example:

s_pred_wz.pngThe equivalent code in Python Bindings would be:


In addition to this prediction, calculated and stored in BigML servers, the Python bindings allow you to instantly create single local predictions in your compute. The Ensemble information is downloaded to your computer the first time it is used, and as  predictions are computed in your machine, there are no additional costs or latency involved. Here is the straightforward code snippet for that:


You can create batches of local predictions by using the predict method in a loop. Alternatively you can upload the new data set you want to predict for to BigML. In this case, results will be stored in the platform when the batch prediction process finishes. Let’s see how to realize this latter option first in Python:


The equivalent code to complete this batch prediction by using WhizzML can be seen below:


A batch prediction comes with configuration options related to the inputs format such as the fields_map, which can be used to map the dataset fields to the ensemble fields especially if they are not identical. Other options affect the output format, like header or separator. You can provide any of these arguments at creation time following the appropriate syntax described in the API documents. We recommend that our readers check out all batch predictions options in the corresponding API documents section.

We hope this post has further encouraged you to start using WhizzML or some of our bindings to more effectively analyze and take action with your data in BigML. We are always open to community contributions to our existing bindings or to any new ones that you think we may not yet support.

Don’t miss our next post if you would like to find out what’s happening behind the scenes of BigML’s Boosted Trees.

To learn more about Boosted Trees or to direct us your questions about WhizzML or the bindings, please visit the dedicated release page for more documentation on how to create Boosted Trees, interpret them, and predict with them through the BigML Dashboard and the API; as well as the six blog posts of this series, the slides of the webinar, and the webinar video. 

Programming Boosted Trees

In this, the fourth of our blog posts for the Winter 2017 release, we will explore how to use boosted Trees from the API. Boosted Trees are the latest supervised learning technique in BigML’s toolbox. As we have seen, they differ from more traditional ensembles in that no tree tries to make a correct prediction on its own, but rather is designed to nudge the overall ensemble towards the correct answer.

This post will be very similar to our second post about using Boosted Trees in the BigML Dashboard. Anything that can be done from the Dashboard can be done with our API. Resources created using the BigML API can all be seen in the Dashboard view as well so you can take full advantage of our visualizations.


If you have never used the API before, you will need to go through a quick setup. Simply set the environment variables BIGML_USERNAME, BIGML_API_KEY and BIGML_AUTH. BIGML_USERNAME is just your username. Your BIGML_API_KEY can be found in the Dashboard by clicking on your username to pull up the Account page, and then clicking on API Key. BIGML_AUTH is set as a combination of the two:


1. Upload Your Data

Just as with the Dashboard, your first step is uploading some data to be processed. You can point to a remote source, or upload directly from your computer in a variety of popular file formats.

To do this, you can use the terminal with curl, or any other command-line tool that takes https methods. In this example, we are uploading the local file we used in our last blog post ‘oscars.csv’.

curl "$BIGML_AUTH"
       -F file=@oscars.csv

2. Create a Dataset

A BigML dataset resource is a serialized form of your data, with some simple statistics already calculated and ready to be processed by Machine Learning algorithms. To create a dataset from your uploaded data, use:

curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"source": "source/58c05080983efc27100012fd"}'

In order to know we are creating a meaningful Boosted Tree, we need to split this dataset into two parts: a training dataset to create the model, and a test dataset to evaluate how the model is doing. We will need two more commands to do just that:

curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"origin_dataset" : "dataset/58c051f6983efc2710001302", \
            "sample_rate" : 0.8, "seed":"foo"}'
curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"origin_dataset" : "dataset/58c051f6983efc2710001302", \
            "sample_rate" : 0.8, “out_of_bag” : true, "seed":"foo"}'

This is pretty similar to how we created our dataset, with some key differences. First, since we are creating these datasets from another dataset, we need to use “origin_dataset”. We are sampling at a rate of 80% for the first training dataset, and then setting “out_of_bag” to true to get the other 20% for the second test dataset. The seed is arbitrary, but we need to use the same one for each dataset.

3. Create Your Boosted Trees

Using the training dataset, we will now make an ensemble. A BigML ensemble will construct Boosted Trees if it is passed a parameter “boosting”, a map of other parameters. In the example below, “boosting” will use ten iterations with a learning rate of 10%.  BigML automatically picks the last field of your dataset as the objective field. If this is incorrect, you will want to explicitly pass it the objective field id. 

curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"dataset": "dataset/58c053ac983efc2708000bbf", \
            "objective_field":"000013", \
            "boosting": {"iterations":10, "learning_rate":0.10}}'

Some other parameters for Boosting include:

  • early_holdout: The portion of the dataset that will be held out for testing at the end of every iteration. If no significant improvement is made on the holdout, Boosting will stop early. The default is zero.
  • early_out_of_bag: Whether Out of Bag samples are tested after every iteration and may result in an early stop if no significant improvement is made. To use this option, an “ensemble_sample” must also be requested. The default is true.
  • ensemble_sample: The portion of the input dataset to be sampled for each iteration in the ensemble. The default rate is 1, with replacement true.

For example, we will try setting “early_out_of_bag” to true. To do this, we will also have to set an “ensemble_sample”, say to 65%. This looks like:

curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"dataset": "dataset/58c053ac983efc2708000bbf", \
            "objective_field":"000013", \
            "boosting": {"iterations":10, "learning_rate":0.10, "early_out_of_bag":true} \
            "ensemble_sample": {"rate": 0.65, "replacement": false, "seed": "foo"}}'

4. Evaluate your Boosted Trees

In order to see how well your model is performing, you will need to evaluate it against some test data. This will return an evaluation resource with a result object. For classification models, this will include accuracy, average_f_measure, average_phi, average_precision, average_recall, and a confusion_matrix for the model. So that we can be sure the model is making useful predictions, we include these same statistics for two simplistic alternative predictors: one that picks random classes and one that always picks the most common class. For regression models, we include the average_error, mean_squared_error, and r_squared. Similarly we compare regression models to a random predictor and a predictor which always chooses the mean.

curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"dataset": "dataset/58c0543e983efc2702000c51", \
            "ensemble": "ensemble/58c05480983efc2710001306"}'

5. Make Predictions

Once you are satisfied with your evaluation, you can create one last Boosted Trees model with your entire dataset. Now it is ready to make predictions on some new data. This is done in similar fashion to other BigML models.

curl "$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"ensemble": "ensemble/58c05480983efc2710001306", \
            "dataset": "dataset/58c0543e983efc2702000c51"}'

In our next post of the series, we will see how to automate these steps with WhizzML, BigML’s domain-specific scripting language, and the Python Bindings.

To find out exactly how Boosted Trees work, please visit the dedicated release page for more documentation on how to create Boosted Trees, interpret them, and predict with them through the BigML Dashboard and the API; as well as the six blog posts of this series, the slides of the webinar, and the webinar video. 

Boosting the Oscars

In this blog post, the third one in our six post series on Boosted Trees, we will bring the power of Boosted Trees to a specific example. As we have seen in our previous post, Boosted Trees are a form of supervised learning that combine every tree in the ensemble additively to answer classification and regression problems. With BigML’s simple and beautiful dashboard visualizations, we’ll revisit our answer to who will win the Oscar for Best Actor.

The Data

Already engineered for our recent Oscar predictions post, we took data from many sources, particularly including many related awards, to see if we can answer one of the biggest questions in Hollywood: who will win at the Oscars this year? We did generally well with our Random Decision Forests. Of the eight categories we attempted, we got five correct and another two were knife’s edge calls between the winner and our picks. But can we do even better with Boosted Trees?

The Chart

One major way Boosted Trees differ from Random Decision Forests is that there are more parameters than can be changed. This is both powerful as we can tune the tree to exactly what we want but also intimidating as there are so many knobs to turn! In a future blog post, we will show how to automatically choose those parameters.  In this example, however, we will be working with the iterations slider.

As we have seen, Boosted Trees work by using every iteration to improve on the previous one. This may seem like more iterations are always better, however, this is not always the case. In some cases, we could slowly be stepping toward some optimal answer, but our improvements are so slight with each iteration that it’s not worth the time invested in them. So how to know when to stop? That’s what early stopping does for us. BigML has two forms of early stopping, Holdout and Out of Bag. Holdout reserves some subset of the training data to evaluate how far we have come with each iteration. If the improvement is minimal, the ensemble stops building. It then reruns using all of the data for the chosen number of iterations. Out of Bag uses some of the training data that is not currently being used to build this iteration to gauge the improvement. It is faster than Holdout early stopping, in general, but because it is reusing data that was used for training in earlier iterations it is not as clean a test.

In this example, we chose just 10 iterations with a learning rate of 30%. In general, lower learning rates can help find the best solutions, but need more iterations to get there. Our example also uses the Out of Bag early stopping option.

post 3 - 01

With the Ensemble Summary Report we can see that the two most important fields to this decision are the number of Oscar Categories Nominated and whether it had a Best Actor Nomination.

With the field importance chart, we can also see what other categories are important: Reviews, BAFTA winner, Screen Actors Guild winner, and LA Film Critics Association nominee. We can already see an aberration with this model; clearly an actor must be nominated for best actor to win the award. So we’d expect that to be the most important field, not the second.

Looking at the PDP, we see it is broken into four main sections. The two bluish sections are where the probability is greatest that the movie doesn’t win a Best Actor award, while the red sections are where the probability is that it does. Again, something strange is going on here. The upper right quadrant is coded red which means the model believes an actor could win the award even without a nomination!

Let’s create a different Boosted Tree, this time with 500 iterations and a 10% learning rate. As before, we will employ tree sampling at 65%, building each iteration on a subset of the total training data. For classification problems, there is one tree per class per iteration, for regression problems, just one tree per class.

post 3 - 03

Already we see an improvement. Whether the film is nominated for a Best Actor Oscar is now the most important field. The other top fields include whether it won a Screen Actors Guild award for Best Actor, User Reviews, and its overall rating. This is very different from our first example, which relied heavily on other awards. We also see, as we expect, that movies that didn’t get nominated will not get a best actor award.


But what exactly do our Boosted Trees predict? Looking just at the more promising second model, we can create a Batch Prediction with the movies data just from 2016.

post 3 - 04

In order to get the probabilities of each row, we will go under Configure, and then Output Settings to select the percent sign icon. This will add two columns to our output dataset, one for each class in our objective field: the probability that the movie wins a Best Actor Oscar and the probability that it does not. This way, we can see not only whether the model predicts a win, but also by how much.

post 3 - 05

Our Boosted Trees predict… drumroll please… four different actors might win the Oscar! That is, four different actors have a very good chance of winning. Let’s see who we have: Ryan Gosling in La La Land, Denzel Washington in Fences, Andrew Garfield in Hacksaw Ridge, and finally Casey Affleck in Manchester by the Sea.

Here are the normalized probabilities. All four of these candidates are within a few percent of each other, with Mr. Affleck perhaps the furthest behind. No wonder our model picked four winners! And no wonder we had such a hard time predicting the win with our Random Decision Forest. The race was simply too close to call until the big night.

In the next post, we will see how to create Boosted Trees from the BigML API.

Would you like to know more about Boosted Trees? Please visit the dedicated release page for more documentation on how to create Boosted Trees, interpret them, and predict with them through the BigML Dashboard and the API; as well as the six blog posts of this series, the slides of the webinar, and the webinar video. 

The Six Steps to Boosted Trees

BigML is bringing Boosted Trees to our ever-growing suite of supervised learning techniques. Boosting is a variation on ensembles that aims to reduce bias, potentially leading to better performance than Bagging or Random Decision Forests.

In our first blog post of this series of six posts about Boosted Trees, we saw a gentle introduction to Boosted Trees to get some context about what this new resource is and how it can help you solve your classification and regression problems. This post will take us further, into the detailed steps of how to use boosting with BigML.


Step 1: Import Your Data

To learn from our data, we must first upload it. There are several ways to upload your data to BigML. The easiest is to navigate to the Dashboard and click on the sources tab on the far left. From there you can create a source by importing from Google Drive, Google Storage, Dropbox, MS Azure. If your dataset is not terribly large, creating an inline source by directly typing in the data may appeal to you. You can also create a source from a remote URL, or by uploading a local file (of format .csv, .tsv, .txt, .json, .arff, .data, .gz, or .bz2).

post 2 - 01

Step 2: Create Your Dataset

Once a file is uploaded as a source, it can be turned into a dataset. From your Source view, use the 1-click Dataset option to create a dataset, a structured version of your data ready to be used by a Machine Learning algorithm.

post 2 - 02.png

In the dataset view, you will be able to see a summary of your field values, some basic statistics and the field histograms to analyze your data distributions. This view is really useful to see any errors or irregularities in your data. You can filter the dataset by several criteria and even create new fields from your existing data.

post 2 - 03

Once your data is free of errors you will need to split your dataset into two different subsets: one for training your Boosted Trees, and the other for testing. It is crucial to train and evaluate supervised learning models with different data to get a true evaluation and not be tricked by overfitting. You can easily split your dataset using the BigML 1-click option or the configure option menu, which randomly splits 80% of the data for training and sets aside 20% for testing.

post 2 - 04.png

Step 3: Create Your Boosted Trees

To create Boosted Trees, make sure you are viewing the training split of your dataset, click on Configure Ensemble under the configure option menu. By default, the last field of your dataset is chosen as the objective field, but you can easily change this with the dropdown on the left. To enable boosting, under Type choose Boosted Trees. This will open up the Boosting tab under Advanced Configuration.

post 2 - 05

You can, of course, now use the default settings and click on Create Ensemble. But Machine Learning is never at its most powerful without you, the user, bringing your own domain-specific knowledge to the problem. You will get the best results if you ‘turn some knobs’ and alter the default settings to suit your dataset and problem (in a later blog post we’ll discuss automatically finding good parameters).

post 2 - 06.png

BigML offers many different parameters to tune. One of the most important is the number of iterations. This controls how many individual trees will be built; one tree per iteration for regression and one tree per class per iteration for classification.

Other parameters that can be found under Boosting include:

  • Two forms of early stopping: These will keep the ensemble from performing all the iterations, saving running time and perhaps improving performance. Early Holdout tries to find the optimal stopping time by completely reserving a portion of the data to test at each iteration for improvement, Early Out of Bag simply tests against the out of bag data (data not used in the tree sampling).
  • The Learning Rate: Default is 10%, and the learning rate controls how far to step in the gradient direction. In general, a smaller step size will lead to more accurate results, but will take longer to get there.

Another useful parameter to change is found under Tree Sampling:

  • The Ensemble Rate option ensures that each tree is only created with a subset of your training data, and generally helps prevent overfitting.

Step 4: Analyze Your Boosted Trees

Once your Boosted Trees are created, the resource view will include a visualization called a partial dependence plot, or PDP. This chart ignores the influence of all but the two fields displayed on the axes. If you want other fields to influence the results, you can select them by checking the box in the input fields section or by making them an axis.

post 2 - 07.png

The axes are initially set to the two most important fields. You can change the fields at any time by using the dropdown menus near the X and Y. Each region of the grid is colored based on the class and probability of its prediction. To see the probability in more detail, mouse over the grid and the exact probability appears in the upper righthand area.

Step 5: Evaluate Your Boosted Trees

But how do you know if your parameters are indeed tuned correctly? You need to evaluate your Boosted Trees by comparing its predictions with the actual values seen in your test dataset.

post 2 - 08.png

To do this, in the ensemble view click on Evaluate under the 1-click action menu. You can change the dataset to evaluate it against, but the default 20% test dataset is perfect for this procedure. Click on Evaluate to execute and you will see the familiar evaluation visualization, dependent on whether your problem was a classification or regression.

post 2 - 09.png

Step 6: Make Your Predictions

When you have results you are happy with, it’s time to make some predictions. Create more Boosted Trees with the parameters set the way you like, but this time run it on the entire dataset. This will mean all your data is informing your decisions.

Boosted Trees differ from our other ensemble predictions because they do not return confidence (for classification) but rather the probabilities for all the classes in the objective field.

Now you can make a prediction on some new data. Just as with BigML’s previous supervised learning models, you can make a single prediction for just one instance, or a batch prediction for a whole dataset.

post 2 - 10

In the ensemble view, click on Prediction (or Batch Prediction) under the 1-click action menu. The left hand side will already have your Boosted Trees. Choose the dataset you wish to run your prediction on from the dropdown on the right. You can, of course, customize the name and prediction output settings. Scroll down to click on Predict to create your prediction.

post 2 - 11.png

In the next post, we will see these six steps in action when BigML takes boosting to the Oscars. Stay tuned!

Would you like to find out exactly how Boosted Trees work? Please visit the dedicated release page for more documentation on how to create Boosted Trees, interpret them, and predict with them through the BigML Dashboard and the API; as well as the six blog posts of this series, the slides of the webinar, and the webinar video. 

Introduction to Boosted Trees

We are happy to share that BigML is bringing Boosted Trees to the Dashboard and the API as part of our Winter 2017 Release. This newest addition to our ensemble-based strategies is a supervised learning technique that can help you solve your classification and regression problems even more effectively.

To best inform you about our impending launch, we have prepared a series of blog posts that will help you get a good understanding prior to the official launch. Today, we start with the basic concepts of Boosted Trees. Subsequent posts will gradually dive deeper to help you become a master of this new resource by: demonstrating how to use this technique with the BigML Dashboard, presenting a use case that will help you discern when to apply Boosted Trees, how to use Boosted Trees with the API, as well as how to properly automate it with WhizzML. Finally, we will conclude with a detailed technical view of how BigML Boosted Trees work under the hood.

Let’s begin our Boosted Trees journey!

Why Boosted Trees?

First of all, let’s recap the different strategies based on single decision trees and ensembles BigML offers to solve classification and regression problems, and find out which technique is the most appropriate to achieve the best results depending on the characteristics of your dataset.

  • BigML Models were the first technique BigML implemented, they use a proprietary decision tree algorithm based on the Classification and Regression Trees (CART) algorithm proposed by Leo Breiman. Single Decision Trees are composed of nodes and branches that create a model of decisions with a tree graph. The nodes represent the predictors or labels that have an influence in the predictive path, and the branches represent the rules followed by the algorithm to make a given prediction. Single decision trees are a good choice when you value the human-interpretability of your model. Unlike many ML techniques, individual decision trees are easy for a human to inspect and understand.

  • Bagging (or Bootstrap Aggregating), the second prediction technique brought to the BigML Dashboard and API, uses a collection of trees (rather than a single one), each tree built with a different random subset of the original dataset for each model in the ensemble. Specifically, BigML defaults to a sampling rate of 100% (with replacement) for each model. This means some of the original instances will be repeated and others will be left out. Bagging performs well when a dataset has many noisy features and only one or two are relevant. In those cases, Bagging will be the best option.

  • Random Decision Forests extend the Bagging technique by only considering a random subset of the input fields at each split of the tree. By adding randomness in this process, Random Decision Forests help avoid overfitting. When there are many useful fields in your dataset, Random Decision Forests are a strong choice.

In Bagging or Random Decision Forests, the ensemble is a collection of models, each of which tries to predict the same field, the problem’s objective field. So depending on whether we are solving a classification or a regression problem, our models will have a categorical or a numeric field as objective. Each model is built on a different sample of data (if Random Decision Forests also using a different sample of fields in each split), so their predictions will have some variation. Finally, the ensemble will issue a prediction using component model predictions as votes aggregating them through different strategies (plurality, confidence weighted or probability weighted).

The boosting ensemble technique is significantly different. To begin with, the ensemble is a collection of models that do not predict the real objective field of the ensemble, but rather the improvements needed for the function that computes this objective. As shown in the image above, the modeling process starts by assigning some initial values to this function, and creates a model to predict which gradient will improve the function results. The next iteration considers both the initial values and these corrections as its original state, and looks for the next gradient to improve the prediction function results even further. The process stops when the prediction function results match the real values or the number of iterations reaches a limit. As a consequence, all the models in the ensemble will always have a numeric objective field, the gradient for this function. The real objective field of the problem will then be computed by adding up the contributions of each model weighted by some coefficients. If the problem is a classification, each category (or class) in the objective field has its own subset of models in the ensemble whose goal is adjusting the function to predict this category.

For Bagging and Random Decision Forests each tree is independent from one another making them easy to construct in parallel. Since a boosted tree depends on the previous trees, a Boosted Tree ensemble is inherently sequential. Nonetheless, BigML parallelizes the construction of individual trees. That means even though boosting is a computation heavy model, we can train Boosted Trees relatively quickly.

How do Boosted Trees work in BigML?

Let’s illustrate how Boosted Trees work with a dataset that predicts the unemployment rate in US. Before creating the Boosted Tree ensemble we split the dataset in two parts: one for training and one for testing. Then we train a Boosted Tree ensemble with 5 iterations by using the training subset of our data. Thanks to the Partial Dependence Plot (PDP) visualization, we observe that the higher the “Civilian Employment Population Ratio” the lower the unemployment rate, and vice versa. 

But, can we trust this prediction? To get an objective measure of how well our ensemble is predicting, we evaluate it in BigML as any other kind of supervised model. The performance of our first attempt is not as good as it could be. A simple way of trying to improve it is to increase the number of iterations the boosting algorithm performs in order to improve its tree. Remember that, in our first attempt, we set that number to the very conservative value of 5. Let’s create another Boosted Tree, but this time with 400 iterations. Voila! If we compare the evaluations of both models, as shown in the image below, we observe a very significant performance boost, both in the lower absolute and squared errors and in the higher R squared value for our ensemble with 400 iterations.

Stay tuned for the upcoming posts to find out how to create Boosted Trees in BigML, interpret and evaluate them in order to make better predictions.

In Summary

To wrap up this blog post we can say that Boosted Trees:

  • Are a variation of tree ensembles, where the tree outputs are additive rather than averaged (or majority voted).
  • Do not try to predict the objective field directly. Instead, they try to fit a gradient by correcting mistakes made in previous iterations.
  • Are very useful when you have a lot of data and you expect the decision function to be very complex. The effect of additional trees is basically an expansion of the hypothesis space beyond other ensemble strategies like Bagging or Random Decision Forests.
  • Help solve both classification and regression problems, such as: churn analysis, risk analysis, loan analysis, fraud analysis, sentiment analysis, predictive maintenance, content prioritization, next best offer, lifetime value, predictive advertising, price modeling, sales estimation, patient diagnoses, or targeted recruitment, among others.

For more examples on how Boosted Trees work we recommend that you read this blog post as well as this alternative explanation, which contains a visual example.

Want to know more about Boosted Trees?

Please visit the dedicated release page for more documentation on how to create Boosted Trees, interpret them, and predict with them through the BigML Dashboard and the API; as well as the six blog posts of this series, the slides of the webinar, and the webinar video.

%d bloggers like this: