Skip to content

Logistic Regression versus Decision Trees

The question of which model type to apply to a Machine Learning task can be a daunting one given the immense number of algorithms available in the literature. It can be difficult to compare the relative merits of two methods, as one can outperform the other in a certain class of problems while consistently coming in behind for another class. In this post, the last one of our series of posts about Logistic Regression, we’ll explore the differences between Decision Trees and Logistic Regression for classification problems, and try to highlight scenarios where one might be recommended over the other.

Decision Boundaries

Logistic Regression and trees differ in the way that they generate decision boundaries i.e. the lines that are drawn to separate different classes. To illustrate this difference, let’s look at the results of the two model types on the following 2-class problem:

Decision Trees bisect the space into smaller and smaller regions, whereas Logistic Regression fits a single line to divide the space exactly into two. Of course for higher-dimensional data, these lines would generalize to planes and hyperplanes. A single linear boundary can sometimes be limiting for Logistic Regression. In this example where the two classes are separated by a decidedly non-linear boundary, we see that trees can better capture the division, leading to superior classification performance. However, when classes are not well-separated, trees are susceptible to overfitting the training data, so that Logistic Regression’s simple linear boundary generalizes better.

Lastly, the background color of these plots represents the prediction confidence. Each node of a Decision Tree assigns a constant confidence value to the entire region that it spans, leading to a rather patchwork appearance of confidence values across the entire space. On the other hand, prediction confidence for Logistic Regression can be computed in closed-form for any arbitrary input coordinates, so that we have an infinitely more fine-grained result and can be more confident in our prediction confidence values.

Interpretability

Although the last example was designed to give Logistic Regression a performance advantage, its resulting f-measure did not exactly beat the Decision Tree’s by a huge margin. So what else is there to recommend Logistic Regression? Let’s look at the tree model view in the BigML web interface:

Model 5768cee47e0a8d34dd0167bd | BigML.com - Chromium_056

When a tree consists of a large number of nodes, it can require a significant amount of mental effort to comprehend all the splits that lead up to a particular prediction. In contrast, a Logistic Regression model is simply a list of coefficients:

Selection_057

At a glance, we are able to see that an instance’s y-coordinate is just over three times as important as its x-coordinate for determining its class, which is corroborated by the slope of the decision boundary from the previous section. An important caveat to this is in regards to scale. If for example, x and y were given in units of meters and kilometers respectively, we should expect their coefficients to differ by a factor of 1000 in order to represent equal importance in a real-world, physical sense. Because Logistic Regression models are fully described by their coefficients, they are attractive to users who have some familiarity with their data, and are interested in knowing the influence of particular input fields on the objective.

Source Code

The code for this blog post consists a WhizzML script to train and evaluate both Decision Tree and Logistic Regression models, plus a Python script which executes the WhizzML and draws the plots. You can view it on GitHub.

Learn more about Logistic Regression in our release page. You will find documentation on how to use Logistic Regression with the BigML Dashboard and the BigML API. You can also see the webinar slideshow and the other blog posts of this series about Logistic Regression.

Automating Logistic Regression Workflows

by

Continuing with our series of posts about Logistic Regression in this fifth post we will focus on the point of view of a WhizzML user. WhizzML is BigML’s popular domain specific language for Machine Learning, which provides programmatic support for all the resources you work with in BigML. You can use WhizzML scripts to create a Logistic Regression, or to create a prediction or batch prediction based on a Logistic Regression.  

Let’s begin with the easiest one: If you want to create a Logistic Regression with all the default values you just need to create a script with the following source code:

screen-shot-2016-09-26-at-10-00-51

As BigML’s API is asynchronous, the create call will probably return a response before the Logistic Regression is totally built. Thus, if you want to use the Logistic Regression to make predictions, you should wait until the creation process has been completed. If you want to stop the code from processing until the Logistic Regression is finished you can use the “create-and-wait-logisticregression” directive.

screen-shot-2016-09-26-at-10-01-47

To modify the default value of a Logistic Regression property you can simply add it in the properties map as a pair: “<property_name>” <property_value>. For instance, to calculate a Logistic Regression with a dataset that contains missing values, the normal default behavior in BigML is to replace them by the mean. However, if you want to replace them by zero you should add default_numeric_value and set it to “zero”. The source code will be as follows:

screen-shot-2016-09-26-at-10-05-07

You can modify any configuration option in similar fashion. The BigML API documentation contains detailed information about those properties.

What if you have an existing Logistic Regression, and you want to get the code needed to recreate it with WhizzML? No problem, programmer or not BigML has a solution for you. Say you already tuned a Logistic Regression in BigML and you want to repeat the process on a new source that you just uploaded to the service. You can easily use the scriptify utility. This will generate a script that will run the exact steps needed to reproduce the Logistic Regression. Just navigate to the Logistic Regression you want to replicate and click on the “SCRIPTIFY LOGISTIC REGRESSION” link.

screen-shot-2016-09-23-at-16-29-44

If you want to create a prediction from your Logistic Regression with WhizzML, the code is also short and easy. You just need the ID of the Logistic Regression you want to use and a collection of the new instances that you want to predict for, i.e. your input data. In the input_data collection the field ID is used as key.  Here’s an example:

screen-shot-2016-09-24-at-00-09-37

In case you need to predict not for a single instance but for a set of new instances, you will need to create a batch prediction from your Logistic Regression by using WhizzML.

screen-shot-2016-09-23-at-22-55-19

Once your source code is in place, how do you execute your script?

  • Using the BigML Dashboard, look for the script you just created. Opening the script view will reveal the available inputs, and you will be able to select their new values after which you can start the execution. For instance, the first script on this post looks as follows, and it expects you to select the dataset you want to create the Logistic Regression from.
    screen-shot-2016-09-24-at-00-28-19
  • If you want to execute the script through the API, you need to know the ID of the script you previously created. To follow the same example, the dataset you want to create a Logistic prediction for (input “ds1”) should be included in the list of inputs. The corresponding request to the BigML API should be as below:

    curl "https://bigml.io/execution?$BIGML_AUTH"
           -X POST
           -H 'content-type: application/json'
           -d '{"script": "script/55f007d21f386f5199000003",
                "inputs": [["ds1", "dataset/55f007d21f386f5199000000"]]}'

These Logistic Regressions should execute swiftly, while you reach out for you coffee.

If you have any doubt or you want to learn more about Logistic Regression please check out our release page for documentation on how to use Logistic Regression with the BigML Dashboard and the BigML API. You can also see the webinar slideshow and the other blog posts of this series about Logistic Regression.

Programming Logistic Regressions

In this post, the fourth one of our Logistic Regression series, we want to provide a brief summary of all the necessary steps to create a Logistic Regression using the BigML API. As we mentioned in our previous posts, Logistic Regression is a supervised learning method to solve classification problems, i.e., the objective field must be categorical and it can consist of two or more different classes.

The API workflow to create a Logistic Regression and use it to make predictions is very similar to the one we explained for the Dashboard in our previous post. It’s worth mentioning that any resource created with the API will automatically be created in your Dashboard too, so you can take advantage of BigML’s intuitive visualizations at any time.

flow_api.png

In case you never used the BigML API before, all requests to manage your resources must use HTTPS and be authenticated using your username and API key to verify your identity. Find below a base URL example to manage Logistic Regressions.

https://bigml.io/logisticregression?username=$BIGML_USERNAME;api_key=$BIGML_API_KEY

You can find your authentication details in your Dashboard account by clicking in the API Key icon in the top menu.

flow_api.png

Ok, time to create a Logistic Regression from scratch!

1. Upload your Data

You can upload your data, in your preferred format, from a local file, a remote file (using a URL) or from your cloud repository e.g., AWS, Azure etc. This will automatically create a source in your BigML account.

First, you need to open up a terminal with curl or any other command-line tool that implements standard HTTPS methods. In the example below we are creating a source from a remote CSV file containing some patients data, each row representing one patient’s information.

curl "https://bigml.io/source?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"remote": "https://static.bigml.com/csv/diabetes.csv"}'

2. Create a Dataset

After the source is created, you need to build a dataset, which serializes your data and transforms it into a suitable input for the Machine Learning algorithm.

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"source":"source/68b5627b3c1920186f000325"}'

Then, split your recently created dataset into two subsets: one for training the model and another for testing it. It is essential to evaluate your model with data that the model hasn’t seen before. You need to do this in two separate API calls that create two different datasets.

  • To create the training dataset, you need the original dataset ID, and the sample_rate  (the proportion of instances to include in the sample) as arguments. In the example below we are including 80% of the instances in our training dataset. We also set a particular seed argument to ensure that the sampling will be deterministic. This will ensure that the instances selected in the training dataset will never be part of the test dataset created with the same sampling hold out.

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/68b5627b3c1920186f000325", 
            "sample_rate":0.8, "seed":"foo"}'
  • For the testing dataset, you also need the original dataset ID, and the sample_rate, but this time we combine it with the out_of_bag argument. The out of bag takes the (1- sample_rate) instances, in this case, 1-0.8=0.2. Using those two arguments along with the same seed used to create the training dataset, we ensure that the training and testing datasets are mutually exclusive.

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/68b5627b3c1920186f000325", 
            "sample_rate":0.8, "out_of_bag":true, "seed":"foo"}'

3. Create a Logistic Regression

Next, use your training dataset to create a Logistic Regression. Remember that the field you want to predict must be categorical. BigML takes the last valid field in your dataset as the objective field by default, if it is not categorical and you didn’t specify another objective field, the Logistic Regression creation will throw an error. In the example below, we are creating a Logistic Regression including an argument to indicate the objective field. To specify the objective field you can either use the field name or the field ID:

curl "https://bigml.io/logisticregression?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/98b5527c3c1920386a000467", 
            "objective_field":"diabetes"}'

You can also configure a wide range of the Logistic Regression parameters at creation time. Read about all of them in the API documentation.

Usually, Logistic Regressions can only handle numeric fields as inputs, but BigML automatically performs a set of transformations such that it can also support categorical, text and items fields. BigML uses one-hot encoding by default, but you can configure other types of transformations using the different encoding options provided.

4. Evaluate the Logistic Regression

Evaluating your Logistic Regression is key to measure its predictive performance  against unseen data. Logistic Regression evaluations yield the same confusion matrix and metrics as any other classification model: precision, recall, accuracy, phi-measure and f-measure. You can read more about these metrics here.

You need the logistic regression ID and the testing dataset ID as arguments to create an evaluation using the API:

curl "https://bigml.io/evaluation?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"logisticregression":"logisticregression/50650bea3c19201b64000024",
            "dataset":"dataset/98b5527c3c1920386a000467"}'

Check the evaluation results and rerun the process by trying other parameter configurations and new features that may improve the performance. There is no general rule of thumb as to when a model is “good enough”. It depends on your context, e.g., domain, data limitations, current solution. For example, if you are predicting churn and you can currently predict only 30% of the churn, elevating that to 80% with your Logistic Regression would be a huge enhancement. However, if you are trying to diagnose cancer, 80% recall may not be enough to get the necessary approvals.

5. Make Predictions

Finally, once you are satisfied with your model’s performance, use your Logistic Regression to make predictions by feeding it new data. Logistic Regression in BigML can gracefully handle missing values for your categorical, text or items fields. This also holds true for numeric fields, as long as you have trained the model with missing_numerics=true (which is the default), otherwise instances with missing values for numeric fields will be dropped.

In BigML you can make predictions for a single instance or multiple instances (in batch). See below an example for each case.

To predict one new data point, just input the values for the fields used by the Logistic Regression to make your prediction. In turn, you get a probability for each of your objective field classes. The class with highest probability is the one predicted. All class probabilities must sum up to 100%.

curl "https://bigml.io/prediction?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"logisticregression":"logisticregression/50650bea3c19201b64000024",
            "input_data":{"age":58, "bmi":36, "plasma glucose":180}}'

To make predictions for multiple instances simultaneously, use the logistic regression ID and the new dataset ID containing the observations you want to predict. You can configure your batchprediction so the final output file also contains the probabilities for all your classes besides the class predicted.

curl "https://bigml.io/batchprediction?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"logisticregression":"logisticregression/50650bea3c19201b64000024",
            "dataset":"dataset/79c5834d6k2920386a000357",
            "probabilities": true}'

In the next post we will explain how to use Logistic Regression with WhizzML, which will complete our series.  One more to go…

If you want to learn more about Logistic Regression please visit our release page for documentation on how to use Logistic Regression with the BigML Dashboard and the BigML API. You can also see the webinar slideshow and the other blog posts of this series about Logistic Regression.

Predicting Airbnb Prices with Logistic Regression

This is the third post in the series that covers BigML’s Logistic Regression implementation, which gives you another method to solve classification problems, i.e., predicting a categorical value such as “churn / not churn”, “fraud / not fraud”, “high/medium/low” risk, etc. As usual, BigML brings this new algorithm with powerful visualizations to effectively analyze the key insights from your model results. This post demonstrates this popular classification technique via a use case that predicts the housing rental prices based on a simplified version of this Airbnb public dataset.

The Data

The dataset contains information about more than 13,000 different accommodations in Amsterdam and includes variables like room type, descriptionneighborhood, latitude, longitude, minimum stays, number of reviews, availability, and price.

datasets

By definition, Logistic Regression only accepts numeric fields as inputs, but BigML applies a set of automatic transformations to support all field types so you don’t have to waste precious time encoding your categorical and text data yourself.

Since the price is a numeric field, and Logistic Regression only works for classification problems, we discretize the target variable into two main categories: cheap prices (< €100 per night) and expensive prices (>= €100 per night).

Finally, we perform some feature engineering like calculating the distance from downtown using the latitude and longitude data in 1-click thanks to a WhizzML script that will be published soon in the BigML Gallery. Incredibly easy!

Let’s dive in!

The Logistic Regression

Creating a Logistic Regression is very easy, especially when using the 1-click Logistic Regression option. (Alternatively, you may prefer the configuration option to tune various model parameters.) After a short wait…voilá! The model has been created, and now you can visually inspect the results with both a two-fold chart (1D and 2D) and the coefficients table.

The Chart

The Logistic Regression chart allows you to visually interpret the influence of one or more fields on your predictions. Let’s see some examples.

In the image below, we selected the distance (in meters) from downtown for the x-axis. As you may expect, the probability of an accommodation to be cheap (blue line) increases as the distance increase, while the probability of being expensive (orange line) decreases. At some point (around 8 kilometers) the slope softens and the probabilities tend to be constant.

Following the same example, you can also see the combined influence of other field values by using the input fields form to the right. See in the images below, the impact of the room type on the correlation between distance and price. When “Shared room” is selected and the accommodation is 3 kilometers far from downtown, there is a 75% probability for the cheap class. However, if we select “Entire home/apt”, given the same distance, there is a 83% probability of finding an expensive rental.

The combined impact of two fields on predictions can be better visualized in the 2D chart. by clicking in the green switch at the top. A heat map chart containing the class probabilities is appears, and you can select the input fields for both axes. The image below shows the great difference in cheap and expensive price probabilities depending on the neighborhood, while the feature minimum nights in the x-axis seems to have less influence on the price.

chart-4

You can also enter text and item values into the corresponding form fields on the right. Keeping the same input fields in the axis, see below the increase of the expensive class probability across all neighborhoods due to the presence of the word “houseboat” in the accommodation description.

chart-5

The Coefficients Table

For more advanced users, BigML also displays a table where you can inspect all the coefficients for each of the input fields (rows) and each of the objective field classes (columns).

The coefficients can be interpreted in two ways:

  • Correlation direction: given an objective field class, a positive coefficient for a field indicates that higher values for that field will increase the probability of the class. By contrast, negative coefficients indicate a negative correlation between the field and the class probability.
  • Field importance: you can interpret absolute values for coefficients as feature importance only in the case that all fields have the same magnitude and they are not multicollinear i.e., two or more fields are not correlated between them.

In the example below, you can see the coefficient for the room type “Entire home/apt” is positive for the expensive class and negative for the cheap class, indicating the same behavior that we saw at the beginning of this post in the 1D chart.

table

Predictions

After evaluating your model, when you finally are satisfied with it, you can go ahead and start making predictions. BigML offers predictions for a single instance or multiple instances (in batch).

In the example below, we are making a prediction for a new single instance: a private room located in Westerpark, with the word “studio” in the description and a minimum stay of 2 nights. The class predicted is cheap with a probability of 95.22% while the probability of being an expensive rental is just 4.78%.

predictions.png

We encourage you to check out the other posts in this series: the first post was about the basic concepts of Logistic Regression, the second post covered the six necessary steps to get started with Logistic Regression, this third post explains how to predict with BigML’s Logistic Regressions, the fourth and fifth posts will cover how to create a Logistic Regression with the BigML API and with WhizzML respectively, and finally, the sixth post will dive into the differences between Logistic Regression and Decision Trees. You can find all these posts in our release page, as well as more documentation on how to use Logistic Regression with the BigML Dashboard, the BigML API, and the complete webinar slideshow about Logistic Regression. 

Logistic Regressions: the 6 Steps to Predictions

BigML is bringing Logistic Regression to the Dashboard so you can solve complex classification problems with the help of powerful visualizations to inspect and analyze your results. Logistic Regression is one of the best-known supervised learning algorithms to predict binary or multi-class categorical values such as “True/False”, “Spam/ Not Spam”, “Offer A / Offer B / Offer C”, etc.

In this post we aim to take you through the 6 necessary steps to get started with Logistic Regression:

flow

1. Uploading your Data

As usual, start by uploading your data to your BigML account. BigML offers several ways to do it, you can drag and drop a local file, connect BigML to your cloud repository (e.g., S3 buckets) or copy and paste a URL. BigML automatically identifies the field types. Field types and other source parameters can be configured by clicking in the source configuration option.

2. Create a Dataset

From your source view, use the 1-click dataset option to create a dataset, a structured version of your data ready to be used by a Machine Learning algorithm.

image1.png

In the dataset view you will be able to see a summary of your field values, some basic statistics and the field histograms to analyze your data distributions. This view is really useful to see any errors or irregularities in your data. You can filter the dataset by several criteria and create new fields using different pre-defined operations.

image2.png

Once your data is clean and free of errors you can split your dataset in two different subsets: one for training your model, and the other for testing. It is crucial to train and evaluate your model with different data to ensure it generalizes well against unseen data. You can easily split your dataset using the BigML 1-click option, which randomly sets aside 80% of the instances for training and 20% for testing.

image3.png

3. Create a Logistic Regression

Now you are ready to create the Logistic Regression using your training dataset. You can use the 1-click Logistic Regression option, which will create the model using the default parameter values. If you are a more advanced user and you feel comfortable tuning the Logistic Regression parameters, you can do so by using the configure Logistic Regression option.

 

Find below a list containing a brief summary for each of the configuration parameters. If you want to learn more about them please check the Logistic Regression documentation .

  • Objective field: select the field you want to predict. By default BigML will take the last valid field in your dataset. Remember it must be categorical!

  • Default numeric value: if your numeric fields contain missing values, you can easily replace them by the field mean, median, maximum, minimum or zero using this option. It is inactive by default.

  • Missing numerics: if your numeric fields contain missing values but you think they have a meaning to predict the objective field, you can use this option to include them in the model. Otherwise, instances with missing numerics will be ignored. It is active by default.

  • Eps: set the value of the stopping criteria for the solver. Higher values can make the model faster, but they may result in a poorer predictive performance. You can set a float value between 0 and 1. It is set to 0.0001 by default.

  • Bias: include or exclude the intercept in the Logistic Regression formula. Including it yields better results in most cases. It is active by default.

  • Auto-scaled fields: automatically scale your fields so they all have the same magnitudes. This will also allow you to compare the field coefficients learned by the model afterwards. It is active by default.

  • Regularization: prevent the model from overfitting by using a regularization factor. You can choose between L1 and L2 regularization. The former usually gives better results. You can also tweak the inverse of the regularization strength.

  • Field codings: select the encoding option that works best for your categorical fields. BigML will automatically transform your categorical values into 0 -1 variables to support non-numeric fields as inputs, which is a method known as one-hot encoding. Alternatively, you can choose among three other types of codings: dummy coding, contrast coding and other coding. You can find a detailed explanation of each one in the documentation.

  • Sampling options: if you have a very large dataset, you may not need all the instances to create the model. BigML allows you to easily sample your dataset at the model creation time. 

At this point you may be wondering… ok, so which parameter values should I use?

Unfortunately, there is not a universal response for that. It depends on the data, the domain and the use case you are trying to solve. Our recommendation is that you try to understand the strengths and weaknesses of your model and iterate trying different features and configurations. To do this, the model visualizations explained in the next point play an essential role.

4. Analyze your Results

When your Logistic Regression has been created you can use BigML’s insightful visualizations to dive into the model results and see the impact of your features on model predictions. Take into account that most of the time, the greatest gains on performance come from feature selection and feature engineering, which can be the most time consuming part of the Machine Learning process. Analyzing the results carefully, and inspecting your model to understand the reasons behind the predictions is key in further validating the findings in contrast to expert opinion.

BigML provides a 1D and 2D chart and the coefficient table to analyze your results.

1D and 2D Chart

The Logistic Regression chart provides a visual way to analyze the impact of one or more fields on predictions.

For the 1D chart you can select one input field in the x-axis. In the prediction legend to the right, you will see the objective class predictions as you mouse over the chart area. 

image6.png

For the 2D chart you can select two input fields, one per axis and the objective class predictions will be plotted in the color heat map chart.

image7.png

By setting the values for the rest of input fields using the form below the prediction legend, you will be able to inspect the combined interaction of multiple fields on predictions.

Coefficients table

BigML also provides a table to display the coefficients learned by the Logistic Regression. Each coefficient has an associated a field (e.g., checking_status) and an objective field class (e.g., bad, good etc.). A positive coefficient indicates a positive correlation between the input field and the objective field class, while a negative coefficient indicates a negative relationship.

image 11.png

To find out more about the interpretation of the Logistic Regression chart and coefficient table results, follow our next blog post of this series: Predicting Airbnb Prices with BigML Logistic Regression.

5. Evaluate the Logistic Regression

Like any supervised learning method, Logistic Regression needs to be evaluated. Just click on the evaluate option in the 1-click menu and BigML will automatically select the remaining 20% of the dataset that you set aside for testing.

image 9.png

The resulting performance metrics to be analyzed are the same ones as for any other classifier predicting a categorical value.

You will get the confusion matrix containing the true positives, false positives, true negatives and false negatives along with the classification metrics: precision, recall, accuracy, f-measure and phi-measure. For a full description of the confusion matrix and classification measures see the corresponding documentation.

image 10.png

Rinse and repeat! As we mentioned at the end of step 3, repeat steps from 3 to 5 trying out different configurations, different features, etc. until you have a good enough model.

6. Make Predictions

When you finally reach a satisfying model performance, you can start making predictions with it. In BigML, you can make predictions for a new single instance or multiple instances in batch. Let’s take a quick look to both of them!

Single predictions

Click in the Predict option and set the values for your input fields.

image12.png

A form containing all your input fields will be displayed and you will be able to set the values for a new instance. At the top of the view you will see the objective class probabilities changing as you change your input field values.

image 14.png

Batch predictions

Use the Batchprediction option in the 1-click menu and select the dataset containing the instances for which you want to know the objective field value.

image13.png

You can configure several parameters of your batch prediction like the possibility to include all class probabilities in the batch prediction output dataset and file. When your batch prediction finishes you will be able to download the CSV file and see the output dataset.

image 15.png

In the next post we will cover a real use case using Logistic Regression to predict Airbnb prices to delve into the Logistic Regression results interpretation.

If you want to learn more about Logistic Regression please visit our release page for documentation on how to use Logistic Regression with the BigML Dashboard and the BigML API. You can also see the webinar slideshow and the other blog posts of this series about Logistic Regression.

Image

BigML for Alexa: The First Voice Controlled Predictive Assistant

by

bigml_alexa_open

 

The promise of voice recognition has been around for a long time, but it has always been quite miserable. In fact, just back in 2012 my daughter and I were helping my mother purchase a new car. I paired my phone to the in-car audio and tried to dial home. After several attempts, we were nearly in tears from laughing at how impossibly bad it was at recognizing the spoken phone number.

However, in the last few years, advances in Machine Learning have improved the capability of voice recognition dramatically; see for example the section about the history of Siri here. Even more importantly, the availability of voice recognition APIs like Amazon’s Alexa Voice Service have made it possible for the rapid adoption of voice controlled applications.

But what about that moment in Star Trek IV, The Voyage Home, when Scotty not only expects to be able to speak to the computer, but to have the computer reply intelligently? To get there we need to not only rely on machine learning for voice recognition, but to bring voice recognition to machine learning applications!

As of today, we are one step further along that path with the introduction of the BigML for Alexa skill.

The BigML for Alexa skill combines the predictive power of BigML with the voice processing capabilities of the Alexa Voice Service. Using an Alexa enabled device like an Amazon Echo or Dot, this integration makes it possible to use spoken questions and answers to generate predictions using your own models trained in BigML.

For example, if you have data regarding wine sales with features like the sale month and grape variety, you could build a model which predicts the sales for a given month, variety, etc. With this model loaded into the BigML for Alexa skill, you could generate a sales prediction by answering questions vocally.

 

bigml_alexaIf you already have an AVS device like the Amazon Echo, you can quickly get a feel for the capabilities of the BigML Alexa skill in two steps:

First, enable the skill with:

“Alexa, enable the Big M. L. skill”

Then you can run a demo with:

“Alexa, ask Big M. L. to give me a demo”

This will load a model which ask questions about a patient’s diagnostic measurements like the 4-hour plasma glucose and BMI and uses your answers to make a prediction about the likelihood of that individual having diabetes. Of course, keep in mind that this is only a demo and is not medical advice!

If you want to try the BigML Alexa skill with your own BigML models, you just need to link the skill to your BigML account:

alexa_two

And then ask to load the latest model with

“Alexa, ask Big M. L. to load the latest model”.

This will load your most recently created model and launch a prediction.

As you start to play with your own models, you may run into some quirks with how field names are spoken, especially if they have punctuation or abbreviations. No worries – you can control how the fields are spoken using the labels and descriptions in your BigML dataset.

How to do this and lots of other tips and tricks can be found in the BigML for Alexa documentation

Now we just need the formula for transparent aluminum!

Introduction to Logistic Regression

This Summer 2016 Release BigML is bringing Logistic Regression to the Dashboard, a very popular supervised Machine Learning method for solving classification problems. This upcoming release is the perfect scenario to guide you through Logistic Regression step by step. That is why we are presenting several blog posts to introduce you to this Machine Learning method.

Within this first post you will have a general overview of what Logistic Regression is. In the coming days, we will be complementing this post with five more entries: a second post that will take you through the six necessary steps to get started with Logistic Regression, a third post about how to make predictions with BigML’s Logistic Regression, a fourth blog post about how to create a Logistic Regression using the BigML API and a fifth one that will explain the same process using WhizzML instead, and finally, a sixth post that will analyze the difference between Logistic Regression and Decision Trees.

Let’s get started with Logistic Regression!

introduction_logistic

Why Logistic Regression?

Before machine learning hit the scene, the go-to tool for statistical modelling was regression analysis. Regressions aim to model the behavior of an objective variable  as a combination of effects from a number of predictor variables. Among these tried-and true techniques is Logistic Regression, which was originally developed by statistician David Cox in 1958. Logistic regression is used to solve classification problems, where the objective is a categorical variable. Let’s see a simple example.

diabetes

The dataset we’ll work with contains health statistics from 768 people belonging to the Pima Native American ethnic group. Our objective is to model the effect of an individual’s plasma glucose level on whether that individual contracts diabetes. The above scatterplot shows the plasma glucose levels of individuals with and without diabetes. At a glance, we can see that some relationship exists, where higher levels of plasma glucose are indicative of having diabetes. Note that while the x-axis is numeric in the scatterplot, the y-axis is categorical. This means we’ll need to apply a transformation before we can encode this relationship numerically. Rather than relating glucose levels directly to true/false values, we model the probability of diabetes as a function of plasma glucose. Speaking in terms of probability is appropriate because, as we can see in the above graph, there is no clear-cut threshold on plasma glucose beyond which we can say a person will have diabetes. Most individuals with plasma glucose below 80 mmol/L do not have diabetes, while most with levels above 180 mmol/L have diabetes. Within that range however, there is a significant amount of overlap. We need a way to express this interval of fuzziness flanked by two zones of certainty. For this purpose, we’ll use a function called the logistic function.

diabetes-lr

In this single-predictor example, our regression function is characterized by only two parameters: the slope of the transition and where the transition point is located along the x-axis. Fitting a logistic regression is simply learning the values of these parameters. The ability to encapsulate the model in only two numbers is one of the main selling points of logistic regression. Check out these slides from the Valencian Summer School in Machine Learning 2016 for more details and examples.

Having said that, in this first post we won’t go too deep into Logistic Regression. Sometimes simplicity is the key to understand the basic concepts:

  • The aim of a Logistic Regression is to model the probability of an event that occurs depending on the values of the independent variables.
  • A Logistic Regression estimates the probability that an event occurs for a randomly selected observation versus the probability of this event not occurring at all.
  • A Logistic Regression classifies observations by estimating the probability that an observation is in a particular category.

These videos from Brandon Foltz offer a deeper dive into the essence of Logistic Regression:

 

Want to know more about Logistic Regression?

Check out our release page for documentation on how to use Logistic Regression with the BigML Dashboard and the BigML API. You can also see the webinar slideshow and the other blog posts of this series about Logistic Regression. 

BigML Summer 2016 Release and Webinar: Logistic Regression and more!

BigML’s Summer 2016 Release is here! Join us on Wednesday, September 28, 2016 at 10:00AM US Pacific Time (Portland, Oregon / GMT -08:00) / 7:00 PM CET (Valencia, Spain. GMT +02:00) for a FREE live webinar to learn about the newest version of BigML. We’ll be diving into Logistic Regression, one of the most popular supervised Machine Learning methods for solving classification problems.

logistic_regressions1d

Last Fall we launched Logistic Regressions in the BigML API to let you easily create and download models to your environment for fast, local predictions. With this Summer Release, we go a step further by bringing Logistic Regression to the BigML Dashboard. This new and intuitive Dashboard visualization includes a chart and a coefficients table. The former lets you analyze the impact of an input field in the objective field predictions, whereas the table shows all the coefficients learned for each of the logistic function variables, ideal for inspecting model results and debugging tasks.

logistic_regressions2d

You can plot the impact of the input fields in either one or two dimensions, simply select the desired option with the green slider. On the right-hand-side legend you will see the class probabilities change according to the input fields selected as the axis. You can also set the values for the rest of the input fields using the form below the legend.

logistic_regressions_pred

The ultimate goal of creating a Logistic Regression is to make predictions with it. You can easily predict single instances using the BigML prediction form, just input the values for the fields used by the Logistic Regression and you will get an immediate response of the predicted class along with its probability. BigML also provides the probabilities for the rest of classes in the objective field in a visual histogram that changes in real-time as you configure the input field values.

education-program

In addition to commercial activities, BigML plays an active role in promoting Machine Learning for education. With special offers, our education program is rapidly expanding around the world thanks to the participation from top universities all around the World. We would like to spread the word to more students, professors and academic researchers with your help, so please feel free to refer your fellow educators.

Are you ready to discover all you can do with Logistic Regressions? Join us on Wednesday, September 28, 2016 at 10:00AM PDT (Portland, Oregon. GMT -07:00) / 7:00 PM CET (Valencia, Spain. GMT +02:00). Be sure to reserve your free spot today as space is limited! We will also be giving away BigML t-shirts to those who submit questions during the webinar. Don’t forget to request yours!

2nd Valencian Summer School in Machine Learning: Done and Dusted!

This week over 140 attendees representing 53 companies and 21 academic organizations from 19 countries, gathered in Valencia to get their hands dirty with a curriculum jam packed with practical Machine Learning techniques and case studies that they can put to good use where they work or teach.

_mg_3284

The regularly scheduled sessions were capped with an additional surprise talk from BigML’s Strategic Advisor Professor Geoff Webb about Multiple Test Correction for Streams and Cascades of Statistical Hypothesis Tests that he has been developing as part of his recent academic research.

The diverse backgrounds of the attendees and their active participation and willingness to absorb Machine Learning knowledge have jointly served to prove that the proverbial Machine Learning genie is out of the bag never again to be solely confined to small academic and scientific circles. The writing is already on the wall. In today’s knowledge economy driven increasingly by smart applications, Machine Learning is no longer an elective. Rather, it’s one of the main courses to be mastered by developers, engineers, information technology professionals, analysts, and even hands-on functional specialists from areas as varied as marketing, sales, supply chain, operations, finance or human resources. We thank all of our graduates for their enthusiasm as well as their valuable feedback teaching us a few new things in the process.

vssml16-group

As the BigML family, we wish to stay connected for new editions of our training events to be held in larger and larger venues! THANK YOU VERY MUCH!

Introducing the Artificial Intelligence Startup Battle in Boston on October 12 at PAPIs ‘16

Telefónica Open Future_, Telefónica’s startup accelerator that helps the best entrepreneurs grow and build successful businesses, and PAPIs.io invite you to participate in the Artificial Intelligence Startup Battle of PAPIs ‘16, the 3rd International Conference on Predictive Applications and APIs, to be held in Boston on October 12 at the Microsoft New England Research and Development Center.

battle_boston

Artificial Intelligence (AI) has a track-record of improving the way we make decisions. So why not use it to decide which startups to invest in, and take advantage of all the startup data that is available? The AI Startup Battle, powered by PreSeries  (a joint venture between BigML and Telefónica Open Future_), is a unique experience you don’t want to miss, where you’ll witness real-world and high-stakes AI.

As an early stage startup, you will enjoy a great opportunity to secure seed investment, and get press coverage in one of the technology capitals of the world. On the other hand, attendees will discover disruptive innovation from the most promising startups in AI, as the winner will be chosen by an impartial algorithm that evaluates startups’ chances of success based on signals derived from decades of entrepreneurial undertakings.

Want to compete in the AI Startup Battle?

If you are a startup with applied AI and Machine Learning as a core component of your offering, then we’ll be happy to meet you! Submit your application and if you are selected, you’ll be able to pitch on stage, make connections at the conference, and get unique exposure among a highly distinguished audience.

Five Artificial Intelligence startups will be selected to present their projects on stage on October 12 at the closing of the PAPIs ‘16 conference. They will be automatically judged by an application that uses a Machine Learning algorithm to predict the probability of success of a startup, without human intervention.

The five startups selected to present will get a free exhibitor package at PAPIs worth $4,000 each.

The winner of the battle will be invited to Telefonica Open Future_’s acceleration program and will receive funding of up to $50,000. The winner will not only enjoy an incredible place to work but also access to mentors, business partners, a global network of talent as well as the opportunity to reach millions of Telefónica customers.

NOTE: During the acceleration program, part of the startup team should be working from one of the countries where Telefónica operates (i.e., Argentina, Brazil, Chile, Colombia, Germany, Mexico, Peru, Spain, United Kingdom, or Venezuela).

Disrupting early stage technology investments

It only seems like yesterday when Telefónica Open Future_ sponsored the World’s first AI Startup Battle last March in Valencia, when we first introduced PreSeries to the world. The event was the warm-up round to the upcoming one in Boston. The Madrid based winner, Novelti who scored a competition high 86 points, will now be taking on a new set of competitors in Boston, as we find out some of the most promising early stage startups in Artificial Intelligence and Machine Learning in North America, Europe and the rest of the world.

As a bonus, the minds behind PreSeries will take the center stage to speak about their technology architecture, e.g., the supporting data, the training of the model, and its evaluation framework. It is a rare opportunity to find out what goes on behind the scenes in delivering this innovate real-life predictive application.

Startup Battle highlights:

WHAT:

  • AI Startup Battle.

WHEN:

  • Wednesday, October 12, 2016 from 5:00 PM to 6:30 PM (EDT).

WHERE:

  • Microsoft New England Research and Development Center – 1 Memorial Dr #1 1st floor, Cambridge, Massachusetts, 02142. USA.

APPLICATION PROCESS:

  • To apply to present at the battle, please fill out the application form  before September 29th. Spots are limited and will be awarded on a first come first serve basis.

ATTENDANCE:

CO-ORGANIZED BY:

tp

PAPIs ’16

The scenario of the AI Startup Battle could not be more innovative. The battle is part of PAPIs ’16, the 3rd International Conference on Predictive Applications and APIs. It is a community conference dedicated to real-world Machine Learning and related intelligent applications. Subject-matter experts and leading practitioners from around the world will fly to Boston to discuss new developments, opportunities and challenges in this rapidly evolving space. The conference features tutorials and talks for all levels of experience, and networking events to help you connect with speakers, exhibitors, and other attendees.

As PAPIs conference series makes its debut in the United States, there are also some changes including a pre-conference training day on October 10, 2016.  Curriculum includes Operational Machine Learning with Open Source & Cloud Platforms, where participants will learn about the possibilities of Machine Learning, how to create predictive models from data, operationalize and evaluate them.

Registration to PAPIs ’16 is separate. You can find out more details here.

%d bloggers like this: