Programming Logistic Regressions

Posted by

In this post, the fourth one of our Logistic Regression series, we want to provide a brief summary of all the necessary steps to create a Logistic Regression using the BigML API. As we mentioned in our previous posts, Logistic Regression is a supervised learning method to solve classification problems, i.e., the objective field must be categorical and it can consist of two or more different classes.

The API workflow to create a Logistic Regression and use it to make predictions is very similar to the one we explained for the Dashboard in our previous post. It’s worth mentioning that any resource created with the API will automatically be created in your Dashboard too, so you can take advantage of BigML’s intuitive visualizations at any time.

flow_api.png

In case you never used the BigML API before, all requests to manage your resources must use HTTPS and be authenticated using your username and API key to verify your identity. Find below a base URL example to manage Logistic Regressions.

https://bigml.io/logisticregression?username=$BIGML_USERNAME;api_key=$BIGML_API_KEY

You can find your authentication details in your Dashboard account by clicking in the API Key icon in the top menu.

flow_api.png

Ok, time to create a Logistic Regression from scratch!

1. Upload your Data

You can upload your data, in your preferred format, from a local file, a remote file (using a URL) or from your cloud repository e.g., AWS, Azure etc. This will automatically create a source in your BigML account.

First, you need to open up a terminal with curl or any other command-line tool that implements standard HTTPS methods. In the example below we are creating a source from a remote CSV file containing some patients data, each row representing one patient’s information.

curl "https://bigml.io/source?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"remote": "https://static.bigml.com/csv/diabetes.csv"}'

2. Create a Dataset

After the source is created, you need to build a dataset, which serializes your data and transforms it into a suitable input for the Machine Learning algorithm.

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"source":"source/68b5627b3c1920186f000325"}'

Then, split your recently created dataset into two subsets: one for training the model and another for testing it. It is essential to evaluate your model with data that the model hasn’t seen before. You need to do this in two separate API calls that create two different datasets.

  • To create the training dataset, you need the original dataset ID, and the sample_rate  (the proportion of instances to include in the sample) as arguments. In the example below we are including 80% of the instances in our training dataset. We also set a particular seed argument to ensure that the sampling will be deterministic. This will ensure that the instances selected in the training dataset will never be part of the test dataset created with the same sampling hold out.

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"origin_dataset":"dataset/68b5627b3c1920186f000325", 
            "sample_rate":0.8, "seed":"foo"}'
  • For the testing dataset, you also need the original dataset ID, and the sample_rate, but this time we combine it with the out_of_bag argument. The out of bag takes the (1- sample_rate) instances, in this case, 1-0.8=0.2. Using those two arguments along with the same seed used to create the training dataset, we ensure that the training and testing datasets are mutually exclusive.

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"origin_dataset":"dataset/68b5627b3c1920186f000325", 
            "sample_rate":0.8, "out_of_bag":true, "seed":"foo"}'

3. Create a Logistic Regression

Next, use your training dataset to create a Logistic Regression. Remember that the field you want to predict must be categorical. BigML takes the last valid field in your dataset as the objective field by default, if it is not categorical and you didn’t specify another objective field, the Logistic Regression creation will throw an error. In the example below, we are creating a Logistic Regression including an argument to indicate the objective field. To specify the objective field you can either use the field name or the field ID:

curl "https://bigml.io/logisticregression?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/98b5527c3c1920386a000467", 
            "objective_field":"diabetes"}'

You can also configure a wide range of the Logistic Regression parameters at creation time. Read about all of them in the API documentation.

Usually, Logistic Regressions can only handle numeric fields as inputs, but BigML automatically performs a set of transformations such that it can also support categorical, text and items fields. BigML uses one-hot encoding by default, but you can configure other types of transformations using the different encoding options provided.

4. Evaluate the Logistic Regression

Evaluating your Logistic Regression is key to measure its predictive performance  against unseen data. Logistic Regression evaluations yield the same confusion matrix and metrics as any other classification model: precision, recall, accuracy, phi-measure and f-measure. You can read more about these metrics here.

You need the logistic regression ID and the testing dataset ID as arguments to create an evaluation using the API:

curl "https://bigml.io/evaluation?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"logisticregression":"logisticregression/50650bea3c19201b64000024",
            "dataset":"dataset/98b5527c3c1920386a000467"}'

Check the evaluation results and rerun the process by trying other parameter configurations and new features that may improve the performance. There is no general rule of thumb as to when a model is “good enough”. It depends on your context, e.g., domain, data limitations, current solution. For example, if you are predicting churn and you can currently predict only 30% of the churn, elevating that to 80% with your Logistic Regression would be a huge enhancement. However, if you are trying to diagnose cancer, 80% recall may not be enough to get the necessary approvals.

5. Make Predictions

Finally, once you are satisfied with your model’s performance, use your Logistic Regression to make predictions by feeding it new data. Logistic Regression in BigML can gracefully handle missing values for your categorical, text or items fields. This also holds true for numeric fields, as long as you have trained the model with missing_numerics=true (which is the default), otherwise instances with missing values for numeric fields will be dropped.

In BigML you can make predictions for a single instance or multiple instances (in batch). See below an example for each case.

To predict one new data point, just input the values for the fields used by the Logistic Regression to make your prediction. In turn, you get a probability for each of your objective field classes. The class with highest probability is the one predicted. All class probabilities must sum up to 100%.

curl "https://bigml.io/prediction?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"logisticregression":"logisticregression/50650bea3c19201b64000024",
            "input_data":{"age":58, "bmi":36, "plasma glucose":180}}'

To make predictions for multiple instances simultaneously, use the logistic regression ID and the new dataset ID containing the observations you want to predict. You can configure your Batch Prediction so the final output file also contains the probabilities for all your classes besides the class predicted.

curl "https://bigml.io/batchprediction?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"logisticregression":"logisticregression/50650bea3c19201b64000024",
            "dataset":"dataset/79c5834d6k2920386a000357",
            "probabilities": true}'

In the next post we will explain how to use Logistic Regression with WhizzML, which will complete our series.  One more to go…

If you want to learn more about Logistic Regression please visit our release page for documentation on how to use Logistic Regression with the BigML Dashboard and the BigML API. You can also watch the webinar, see the slideshow, and read the other blog posts of this series about Logistic Regression.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s