In this, the fourth of our blog posts for the Winter 2017 release, we will explore how to use boosted Trees from the API. Boosted Trees are the latest supervised learning technique in BigML’s toolbox. As we have seen, they differ from more traditional ensembles in that no tree tries to make a correct prediction on its own, but rather is designed to nudge the overall ensemble towards the correct answer.
This post will be very similar to our second post about using Boosted Trees in the BigML Dashboard. Anything that can be done from the Dashboard can be done with our API. Resources created using the BigML API can all be seen in the Dashboard view as well so you can take full advantage of our visualizations.
If you have never used the API before, you will need to go through a quick setup. Simply set the environment variables BIGML_USERNAME, BIGML_API_KEY and BIGML_AUTH. BIGML_USERNAME is just your username. Your BIGML_API_KEY can be found in the Dashboard by clicking on your username to pull up the Account page, and then clicking on API Key. BIGML_AUTH is set as a combination of the two:
"username=$BIGML_USERNAME;api_key=$BIGML_API_KEY;"
1. Upload Your Data
Just as with the Dashboard, your first step is uploading some data to be processed. You can point to a remote source, or upload directly from your computer in a variety of popular file formats.
To do this, you can use the terminal with curl, or any other command-line tool that takes https methods. In this example, we are uploading the local file we used in our last blog post ‘oscars.csv’.
curl "https://bigml.io/source?$BIGML_AUTH"
-F file=@oscars.csv
2. Create a Dataset
A BigML dataset resource is a serialized form of your data, with some simple statistics already calculated and ready to be processed by Machine Learning algorithms. To create a dataset from your uploaded data, use:
curl "https://bigml.io/dataset?$BIGML_AUTH" -X POST \ -H 'content-type: application/json' \ -d '{"source": "source/58c05080983efc27100012fd"}'
In order to know we are creating a meaningful Boosted Tree, we need to split this dataset into two parts: a training dataset to create the model, and a test dataset to evaluate how the model is doing. We will need two more commands to do just that:
curl "https://bigml.io/dataset?$BIGML_AUTH" -X POST \ -H 'content-type: application/json' \ -d '{"origin_dataset" : "dataset/58c051f6983efc2710001302", \ "sample_rate" : 0.8, "seed":"foo"}'
curl "https://bigml.io/dataset?$BIGML_AUTH" -X POST \ -H 'content-type: application/json' \ -d '{"origin_dataset" : "dataset/58c051f6983efc2710001302", \ "sample_rate" : 0.8, “out_of_bag” : true, "seed":"foo"}'
This is pretty similar to how we created our dataset, with some key differences. First, since we are creating these datasets from another dataset, we need to use “origin_dataset”. We are sampling at a rate of 80% for the first training dataset, and then setting “out_of_bag” to true to get the other 20% for the second test dataset. The seed is arbitrary, but we need to use the same one for each dataset.
3. Create Your Boosted Trees
Using the training dataset, we will now make an ensemble. A BigML ensemble will construct Boosted Trees if it is passed a parameter “boosting”, a map of other parameters. In the example below, “boosting” will use ten iterations with a learning rate of 10%. BigML automatically picks the last field of your dataset as the objective field. If this is incorrect, you will want to explicitly pass it the objective field id.
curl "https://bigml.io/ensemble?$BIGML_AUTH" -X POST \ -H 'content-type: application/json' \ -d '{"dataset": "dataset/58c053ac983efc2708000bbf", \ "objective_field":"000013", \ "boosting": {"iterations":10, "learning_rate":0.10}}'
Some other parameters for Boosting include:
- early_holdout: The portion of the dataset that will be held out for testing at the end of every iteration. If no significant improvement is made on the holdout, Boosting will stop early. The default is zero.
- early_out_of_bag: Whether Out of Bag samples are tested after every iteration and may result in an early stop if no significant improvement is made. To use this option, an “ensemble_sample” must also be requested. The default is true.
- ensemble_sample: The portion of the input dataset to be sampled for each iteration in the ensemble. The default rate is 1, with replacement true.
For example, we will try setting “early_out_of_bag” to true. To do this, we will also have to set an “ensemble_sample”, say to 65%. This looks like:
curl "https://bigml.io/ensemble?$BIGML_AUTH" -X POST \ -H 'content-type: application/json' \ -d '{"dataset": "dataset/58c053ac983efc2708000bbf", \ "objective_field":"000013", \ "boosting": {"iterations":10, "learning_rate":0.10, "early_out_of_bag":true} \ "ensemble_sample": {"rate": 0.65, "replacement": false, "seed": "foo"}}'
4. Evaluate your Boosted Trees
In order to see how well your model is performing, you will need to evaluate it against some test data. This will return an evaluation resource with a result object. For classification models, this will include accuracy, average_f_measure, average_phi, average_precision, average_recall, and a confusion_matrix for the model. So that we can be sure the model is making useful predictions, we include these same statistics for two simplistic alternative predictors: one that picks random classes and one that always picks the most common class. For regression models, we include the average_error, mean_squared_error, and r_squared. Similarly we compare regression models to a random predictor and a predictor which always chooses the mean.
curl "https://bigml.io/evaluation?$BIGML_AUTH" -X POST \ -H 'content-type: application/json' \ -d '{"dataset": "dataset/58c0543e983efc2702000c51", \ "ensemble": "ensemble/58c05480983efc2710001306"}'
5. Make Predictions
Once you are satisfied with your evaluation, you can create one last Boosted Trees model with your entire dataset. Now it is ready to make predictions on some new data. This is done in similar fashion to other BigML models.
curl "https://bigml.io/batchprediction?$BIGML_AUTH" -X POST \ -H 'content-type: application/json' \ -d '{"ensemble": "ensemble/58c05480983efc2710001306", \ "dataset": "dataset/58c0543e983efc2702000c51"}'
In our next post of the series, we will see how to automate these steps with WhizzML, BigML’s domain-specific scripting language, and the Python Bindings.
To find out exactly how Boosted Trees work, please visit the dedicated release page for more documentation on how to create Boosted Trees, interpret them, and predict with them through the BigML Dashboard and the API; as well as the six blog posts of this series, the slides of the webinar, and the webinar video.
3 comments