Automating Linear Regressions with WhizzML & Python Bindings

Posted by

This blog post, the fifth of our series of six posts about Linear regressions, focuses on those users that want to automate their Machine Learning workflows using programming languages. If you follow the BigML blog, you may already be familiar with WhizzML, BigML’s domain-specific language for automating Machine Learning workflows, implementing high-level Machine Learning algorithms, and easily sharing them with others. WhizzML helps developers create Machine Learning workflows and execute them entirely in the cloud. This avoids network problems, memory issues and lack of computing capacity while taking full advantage of WhizzML’s built-in parallelization. If you aren’t familiar with WhizzML yet, we recommend that you read the series of posts we published this summer about how to create WhizzML scripts: Part 1, Part 2 and Part 3 to quickly discover the benefits.

Screen Shot 2017-03-15 at 01.51.13To help automate the manipulation of BigML’s Machine Learning resources, we also maintain a set of bindingswhich allow users to work in their favorite language (Java, C#, PHP, Swift, and others) with the BigML platform.

Let’s see how to use Linear Regressions through both the popular BigML Python Bindings and WhizzML. Note that the operations described in this post are also available in this list of bindings.

The first step is creating Linear Regressions with the default settings. We start from an existing Dataset to train the model in BigML so our call to the API will need to include the Dataset ID we want to use for training as shown below:

;; Creates a linearregression with default parameters
(define my_linearregression
  (create-linearregression {"dataset" training_dataset}))

The BigML API is mostly asynchronous, that is, the above creation function will return a response before the Linear Regression creation is completed, usually the response informs that creation has started and the resource is in progress. This implies that the Linear Regression is not ready to predict with it right after the code snippet is executed, so you must wait for its completion before you can start with the predictions. A way to get it once it’s finished is to use the directive “create-and-wait-linearregression” for that:

;; Creates a linearregression with default settings. Once it's
;; completed the ID is stored in my_linearregression variable
(define my_linearregression
  (create-and-wait-linearregression {"dataset" training_dataset}))

If you prefer to use the Python Bindings, the equivalent code is this:

from bigml.api import BigML
api = BigML()

my_linearregression = \
    api.create_linearregression("dataset/59b0f8c7b95b392f12000000")

Next up, we will configure some properties of a Linear Regression with WhizzML. All the configuration properties can be easily added using property pairs such as <property_name> and <property_value> as in the example below. For instance, to create an optimized Linear Regression from a dataset, BigML sets the number of model candidates to 128. If you prefer a lower number of steps, you should add the property “number_of_model_candidates and set it to 10. Additionally, you might want to set the value used by the Linear Regression when numeric fields are missing. Then, you need to set thedefault_numeric_valueproperty to the right value. In the example below, it’s replaced by the mean value.

;; Creates a linearregression with some settings. Once it's
;; completed the ID is stored in my_linearregression variable
(define my_linearregression
  (create-and-wait-linearregression {"dataset" training_dataset
                            "number_of_model_candidates" 10
                            "default_numeric_value" "mean"}))

NOTE: Property names always need to be between quotes and the value should be expressed in the appropriate type, a string or a number in the previous example. The equivalent code for the BigML Python Bindings becomes:

from bigml.api import BigML
api = BigML()
args = {"max_iterations": 100000, "default_numeric_value": "mean"}
training_dataset ="dataset/59b0f8c7b95b392f12000000"
my_linearregression = api.create_prediction(training_dataset, args)

For the complete list of properties that BigML offers, please check the dedicated API documentation.

Once the Linear Regression has been created,  as usual for supervised resources, we can evaluate how good its performance is. Now, we will use a different dataset with non-overlapping data to check the Linear Regression performance.  The “test_dataset” parameter in the code shown below represents the second dataset. Following the motto of “less is more”, the WhizzML code that performs an evaluation has only two mandatory parameters: a Linear Regression to be evaluated and a Dataset to use as test data.

;; Creates an evaluation of a linear regression
(define my_linearregression_ev
 (create-evaluation {"linearregression" my_linearregression "dataset" test_dataset}))

Handy, right? Similarly, using Python bindings, the evaluation is done with the following snippet:

from bigml.api import BigML
api = BigML()
my_linearregression = "linearregression/59b0f8c7b95b392f12000000"
test_dataset = "dataset/59b0f8c7b95b392f12000002"
evaluation = api.create_evaluation(my_linearregression, test_dataset)

Following the steps of a typical workflow, after a good evaluation of your Linear Regression, you can make predictions for new sets of observations. In the following code, we demonstrate the simplest setting, where the prediction is made only for some fields in the dataset.

;; Creates a prediction using a linearregression with specific input data
(define my_prediction
 (create-prediction {"linearregression" my_linearregression
                     "input_data" {"sepal length" 2 "sepal width" 3}}))

The equivalent code for the BigML Python bindings is:

from bigml.api import BigML
api = BigML()
input_data = {"sepal length": 2, "sepal width": 3}
my_linearregression = "linearregression/59b0f8c7b95b392f12000000"
prediction = api.create_prediction(my_linearregression, input_data)

In both cases, WhizzML or Python bindings, in the input data you can use either the field names or the field IDs. In other words, “000002”: 3 or “sepal width”: 3 are equivalent expressions.

As opposed to this prediction, which is calculated and stored in BigML servers, the Python Bindings (and other available bindings) also allow you to instantly create single local predictions on your computer or device. The Linear Regression information will be downloaded to your computer the first time you use it (connectivity is needed only the first time you access the model), and the predictions will be computed locally on your machine, without any incremental costs or latency:

from bigml.linearregression import Linearregression
local_linearregression = Linearregression("linearregression/59b0f8c7b95b392f12000000")
input_data = {"sepal length": 2, "sepal width": 3}
local_linearregression.predict(input_data)

It is similarly pretty straightforward to create a Batch Prediction in the cloud from an existing Linear Regression, where the dataset named “my_dataset” contains a new set of instances to predict by the model:

;; Creates a batch prediction using a linearregression 'my_linearregression'
;; and the dataset 'my_dataset' as data to predict for
(define my_batchprediction
 (create-batchprediction {"linearregression" my_linearregression
                          "dataset" my_dataset}))

The code in Python Bindings that performs the same task is:

from bigml.api import BigML
api = BigML()
my_linearregression = "linearregression/59d1f57ab95b39750c000000"
my_dataset = "dataset/59b0f8c7b95b392f12000000"
my_batchprediction = api.create_batch_prediction(my_linearregression, my_dataset)

Want to know more about Linear Regressions?

Our next blog post, the last one of this series, will cover how Linear regressions work behind the scenes, diving into the technical implementation aspects of BigML’s latest resource. If you have any questions or you’d like to learn more about how Linear Regressions work, please visit the dedicated release page. It includes links to this series of six blog posts, in addition to the BigML Dashboard and API documentation.

One comment

Leave a comment