Programming Linear Regressions

Posted by

In this fourth post of our series, we want to provide a brief summary of all the necessary steps to create a Linear Regression using the BigML API. As mentioned in our earlier posts, Linear Regression is a supervised learning method to solve regression problems, i.e., the objective field must be numeric.

The API workflow to create a Linear Regression and use it to make predictions is very similar to the one we explained for the Dashboard in our previous post. It’s worth mentioning that any resource created with the API will automatically be created in your Dashboard too so you can take advantage of BigML’s intuitive visualizations at any time.

linear_regression_workflow

In case you never used the BigML API before, all requests to manage your resources must use HTTPS and be authenticated using your username and API key to verify your identity. Find below a base URL example to manage Linear Regressions.

https://bigml.io/linearregression?username=$BIGML_USERNAME;api_key=$BIGML_API_KEY

You can find your authentication details in your Dashboard account by clicking in the API Key icon in the top menu.

Screen Shot 2019-02-28 at 6.22.34 PM

The first step in any BigML workflow using the API is setting up authentication. Once authentication is successfully set up, you can begin executing the rest of this workflow.

export BIGML_USERNAME=nickwilson
export BIGML_API_KEY=98ftd66e7f089af7201db795f46d8956b714268a
export BIGML_AUTH="username=$BIGML_USERNAME;api_key=$BIGML_API_KEY;"

1. Upload Your Data

You can upload your data in your preferred format, from a local file, a remote file (using a URL) or from your cloud repository e.g., AWS, Azure etc. This will automatically create a source in your BigML account.

First, you need to open up a terminal with curl or any other command-line tool that implements standard HTTPS methods. In the example below, we are creating a source from a local CSV file containing some house data listed in Airbnb, each row representing one house’s information.

curl "https://bigml.io/source?$BIGML_AUTH" -F file=@airbnb.csv

2. Create a Dataset

After the source is created, you need to build a dataset, which serializes your data and transforms it into a suitable input for the Machine Learning algorithm.

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"source":"source/5c7631694e17272d410007aa"}'

Then, split your recently created dataset into two subsets: one for training the model and another for testing it. It is essential to evaluate your model with data that the model hasn’t seen before. You need to do this in two separate API calls that create two different datasets.

  • To create the training dataset, you need the original dataset ID and the sample_rate  (the proportion of instances to include in the sample) as arguments. In the example below, we are including 80% of the instances in our training dataset. We also set a particular seed argument to ensure that the sampling will be deterministic. This will ensure that the instances selected in the training dataset will never be part of the test dataset created with the same sampling hold out.

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"origin_dataset":"dataset/5c762fcd4e17272d4100072d", 
            "sample_rate":0.8, "seed":"myairbnb"}'
  • For the testing dataset, you also need the original dataset ID and the sample_rate, but this time we combine it with the out_of_bag argument. The out of bag takes the (1- sample_rate) instances, in this case, 1-0.8=0.2. Using those two arguments along with the same seed used to create the training dataset, we ensure that the training and testing datasets are mutually exclusive.

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"origin_dataset":"dataset/5c762fcd4e17272d4100072d", 
            "sample_rate":0.8, "out_of_bag":true, "seed":"myairbnb"}'

3. Create a Linear Regression

Next, use your training dataset to create a Linear Regression. Remember that the field you want to predict must be numeric. BigML takes the last numerical field in your dataset as the objective field by default unless it is specified. In the example below, we are creating a Linear Regression including an argument to indicate the objective field. To specify the objective field you can either use the field name or the field ID:

curl "https://bigml.io/linearregression?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/68b5627b3c1920186f000325", 
            "objective_field":"price"}'

You can also configure a wide range of the Linear Regression parameters at creation time. Read about all of them in the API documentation.

Usually, Linear Regressions can only handle numeric fields as inputs, but BigML automatically performs a set of transformations such that it can also support categorical, text and items input fields. Keep in mind that BigML uses dummy encoding by default, but you can configure other types of transformations using the different encoding options provided.

4. Evaluate the Linear Regression

Evaluating your Linear Regression is key to measure its predictive performance against unseen data.

You need the linear regression ID and the testing dataset ID as arguments to create an evaluation using the API:

curl "https://bigml.io/evaluation?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"linearregression":"linearregression/5c762c6b4e17272d42000617",
            "dataset":"dataset/5c762f3a4e17272d41000724"}'

5. Make Predictions

Finally, once you are satisfied with your model’s performance, use your Logistic Regression to make predictions by feeding it new data. Linear Regression in BigML can gracefully handle missing values for your categorical, text or items fields.

In BigML you can make predictions for a single instance or multiple instances (in batch). See below an example for each case.

To predict one new data point, just input the values for the fields used by the Linear Regression to make your prediction. In turn, you get a prediction result for your objective field along with confidence and probability intervals.

curl "https://bigml.io/prediction?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"linearregression":"linearregression/5c762c6b4e17272d42000617",
            "input_data":{"room":4, "bathroom":2, ...}}'

To make predictions for multiple instances simultaneously, use the Linear Regression ID and the new dataset ID containing the observations you want to predict.

curl "https://bigml.io/batchprediction?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"linearregression":"linearregression/5c762c6b4e17272d42000617",
            "dataset":"dataset/5c128e694e1727920d00000c",
            "output_dataset": true}'

If you want to learn more about Linear Regression please visit our release page for documentation on how to use Linear Regression with the BigML Dashboard and the BigML API. In case you still have some questions be sure to reach us at support@bigml.com anytime!

 

One comment

Leave a comment