Skip to content

Programming Logistic Regressions

In this post, the fourth one of our Logistic Regression series, we want to provide a brief summary of all the necessary steps to create a Logistic Regression using the BigML API. As we mentioned in our previous posts, Logistic Regression is a supervised learning method to solve classification problems, i.e., the objective field must be categorical and it can consist of two or more different classes.

The API workflow to create a Logistic Regression and use it to make predictions is very similar to the one we explained for the Dashboard in our previous post. It’s worth mentioning that any resource created with the API will automatically be created in your Dashboard too, so you can take advantage of BigML’s intuitive visualizations at any time.


In case you never used the BigML API before, all requests to manage your resources must use HTTPS and be authenticated using your username and API key to verify your identity. Find below a base URL example to manage Logistic Regressions.$BIGML_USERNAME;api_key=$BIGML_API_KEY

You can find your authentication details in your Dashboard account by clicking in the API Key icon in the top menu.


Ok, time to create a Logistic Regression from scratch!

1. Upload your Data

You can upload your data, in your preferred format, from a local file, a remote file (using a URL) or from your cloud repository e.g., AWS, Azure etc. This will automatically create a source in your BigML account.

First, you need to open up a terminal with curl or any other command-line tool that implements standard HTTPS methods. In the example below we are creating a source from a remote CSV file containing some patients data, each row representing one patient’s information.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"remote": ""}'

2. Create a Dataset

After the source is created, you need to build a dataset, which serializes your data and transforms it into a suitable input for the Machine Learning algorithm.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"source":"source/68b5627b3c1920186f000325"}'

Then, split your recently created dataset into two subsets: one for training the model and another for testing it. It is essential to evaluate your model with data that the model hasn’t seen before. You need to do this in two separate API calls that create two different datasets.

  • To create the training dataset, you need the original dataset ID, and the sample_rate  (the proportion of instances to include in the sample) as arguments. In the example below we are including 80% of the instances in our training dataset. We also set a particular seed argument to ensure that the sampling will be deterministic. This will ensure that the instances selected in the training dataset will never be part of the test dataset created with the same sampling hold out.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/68b5627b3c1920186f000325", 
            "sample_rate":0.8, "seed":"foo"}'
  • For the testing dataset, you also need the original dataset ID, and the sample_rate, but this time we combine it with the out_of_bag argument. The out of bag takes the (1- sample_rate) instances, in this case, 1-0.8=0.2. Using those two arguments along with the same seed used to create the training dataset, we ensure that the training and testing datasets are mutually exclusive.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/68b5627b3c1920186f000325", 
            "sample_rate":0.8, "out_of_bag":true, "seed":"foo"}'

3. Create a Logistic Regression

Next, use your training dataset to create a Logistic Regression. Remember that the field you want to predict must be categorical. BigML takes the last valid field in your dataset as the objective field by default, if it is not categorical and you didn’t specify another objective field, the Logistic Regression creation will throw an error. In the example below, we are creating a Logistic Regression including an argument to indicate the objective field. To specify the objective field you can either use the field name or the field ID:

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/98b5527c3c1920386a000467", 

You can also configure a wide range of the Logistic Regression parameters at creation time. Read about all of them in the API documentation.

Usually, Logistic Regressions can only handle numeric fields as inputs, but BigML automatically performs a set of transformations such that it can also support categorical, text and items fields. BigML uses one-hot encoding by default, but you can configure other types of transformations using the different encoding options provided.

4. Evaluate the Logistic Regression

Evaluating your Logistic Regression is key to measure its predictive performance  against unseen data. Logistic Regression evaluations yield the same confusion matrix and metrics as any other classification model: precision, recall, accuracy, phi-measure and f-measure. You can read more about these metrics here.

You need the logistic regression ID and the testing dataset ID as arguments to create an evaluation using the API:

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"logisticregression":"logisticregression/50650bea3c19201b64000024",

Check the evaluation results and rerun the process by trying other parameter configurations and new features that may improve the performance. There is no general rule of thumb as to when a model is “good enough”. It depends on your context, e.g., domain, data limitations, current solution. For example, if you are predicting churn and you can currently predict only 30% of the churn, elevating that to 80% with your Logistic Regression would be a huge enhancement. However, if you are trying to diagnose cancer, 80% recall may not be enough to get the necessary approvals.

5. Make Predictions

Finally, once you are satisfied with your model’s performance, use your Logistic Regression to make predictions by feeding it new data. Logistic Regression in BigML can gracefully handle missing values for your categorical, text or items fields. This also holds true for numeric fields, as long as you have trained the model with missing_numerics=true (which is the default), otherwise instances with missing values for numeric fields will be dropped.

In BigML you can make predictions for a single instance or multiple instances (in batch). See below an example for each case.

To predict one new data point, just input the values for the fields used by the Logistic Regression to make your prediction. In turn, you get a probability for each of your objective field classes. The class with highest probability is the one predicted. All class probabilities must sum up to 100%.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"logisticregression":"logisticregression/50650bea3c19201b64000024",
            "input_data":{"age":58, "bmi":36, "plasma glucose":180}}'

To make predictions for multiple instances simultaneously, use the logistic regression ID and the new dataset ID containing the observations you want to predict. You can configure your batchprediction so the final output file also contains the probabilities for all your classes besides the class predicted.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"logisticregression":"logisticregression/50650bea3c19201b64000024",
            "probabilities": true}'

In the next post we will explain how to use Logistic Regression with WhizzML, which will complete our series.  One more to go…

If you want to learn more about Logistic Regression join the free live webinar next Wednesday, September 28, 2016 at 10:00AM US Pacific Time (Portland, Oregon / GMT -08:00) / 7:00 PM CET (Valencia, Spain. GMT +02:00). Register now, space is limited!

Predicting Airbnb Prices with Logistic Regression

This is the third post in the series that covers BigML’s Logistic Regression implementation, which gives you another method to solve classification problems, i.e., predicting a categorical value such as “churn / not churn”, “fraud / not fraud”, “high/medium/low” risk, etc. As usual, BigML brings this new algorithm with powerful visualizations to effectively analyze the key insights from your model results. This post demonstrates this popular classification technique via a use case that predicts the housing rental prices based on a simplified version of this Airbnb public dataset.

The Data

The dataset contains information about more than 13,000 different accommodations in Amsterdam and includes variables like room type, descriptionneighborhood, latitude, longitude, minimum stays, number of reviews, availability, and price.


By definition, Logistic Regression only accepts numeric fields as inputs, but BigML applies a set of automatic transformations to support all field types so you don’t have to waste precious time encoding your categorical and text data yourself.

Since the price is a numeric field, and Logistic Regression only works for classification problems, we discretize the target variable into two main categories: cheap prices (< €100 per night) and expensive prices (>= €100 per night).

Finally, we perform some feature engineering like calculating the distance from downtown using the latitude and longitude data in 1-click thanks to a WhizzML script that will be published soon in the BigML Gallery. Incredibly easy!

Let’s dive in!

The Logistic Regression

Creating a Logistic Regression is very easy, especially when using the 1-click Logistic Regression option. (Alternatively, you may prefer the configuration option to tune various model parameters.) After a short wait…voilá! The model has been created, and now you can visually inspect the results with both a two-fold chart (1D and 2D) and the coefficients table.

The Chart

The Logistic Regression chart allows you to visually interpret the influence of one or more fields on your predictions. Let’s see some examples.

In the image below, we selected the distance (in meters) from downtown for the x-axis. As you may expect, the probability of an accommodation to be cheap (blue line) increases as the distance increase, while the probability of being expensive (orange line) decreases. At some point (around 8 kilometers) the slope softens and the probabilities tend to be constant.

Following the same example, you can also see the combined influence of other field values by using the input fields form to the right. See in the images below, the impact of the room type on the correlation between distance and price. When “Shared room” is selected and the accommodation is 3 kilometers far from downtown, there is a 75% probability for the cheap class. However, if we select “Entire home/apt”, given the same distance, there is a 83% probability of finding an expensive rental.

The combined impact of two fields on predictions can be better visualized in the 2D chart. by clicking in the green switch at the top. A heat map chart containing the class probabilities is appears, and you can select the input fields for both axes. The image below shows the great difference in cheap and expensive price probabilities depending on the neighborhood, while the feature minimum nights in the x-axis seems to have less influence on the price.


You can also enter text and item values into the corresponding form fields on the right. Keeping the same input fields in the axis, see below the increase of the expensive class probability across all neighborhoods due to the presence of the word “houseboat” in the accommodation description.


The Coefficients Table

For more advanced users, BigML also displays a table where you can inspect all the coefficients for each of the input fields (rows) and each of the objective field classes (columns).

The coefficients can be interpreted in two ways:

  • Correlation direction: given an objective field class, a positive coefficient for a field indicates that higher values for that field will increase the probability of the class. By contrast, negative coefficients indicate a negative correlation between the field and the class probability.
  • Field importance: you can interpret absolute values for coefficients as feature importance only in the case that all fields have the same magnitude and they are not multicollinear i.e., two or more fields are not correlated between them.

In the example below, you can see the coefficient for the room type “Entire home/apt” is positive for the expensive class and negative for the cheap class, indicating the same behavior that we saw at the beginning of this post in the 1D chart.



After evaluating your model, when you finally are satisfied with it, you can go ahead and start making predictions. BigML offers predictions for a single instance or multiple instances (in batch).

In the example below, we are making a prediction for a new single instance: a private room located in Westerpark, with the word “studio” in the description and a minimum stay of 2 nights. The class predicted is cheap with a probability of 95.22% while the probability of being an expensive rental is just 4.78%.


We encourage you to check out the other posts in this series: the first post was about the basic concepts of Logistic Regression, the second post covered the six necessary steps to get started with Logistic Regression, this third post explains how to predict with BigML’s Logistic Regressions, the fourth and fifth posts will cover how to create a Logistic Regression with the BigML API and with WhizzML respectively, and finally, the sixth post will dive into the differences between Logistic Regression and Decision Trees. So please stay tuned until then…

If you want to learn more about Logistic Regression don’t miss the free live webinar next Wednesday, September 28, 2016 at 10:00AM US Pacific Time (Portland, Oregon / GMT -08:00) / 7:00 PM CET (Valencia, Spain. GMT +02:00). Register now as space is limited!

Logistic Regressions: the 6 Steps to Predictions

BigML is bringing Logistic Regression to the Dashboard so you can solve complex classification problems with the help of powerful visualizations to inspect and analyze your results. Logistic Regression is one of the best-known supervised learning algorithms to predict binary or multi-class categorical values such as “True/False”, “Spam/ Not Spam”, “Offer A / Offer B / Offer C”, etc.

In this post we aim to take you through the 6 necessary steps to get started with Logistic Regression:


1. Uploading your Data

As usual, start by uploading your data to your BigML account. BigML offers several ways to do it, you can drag and drop a local file, connect BigML to your cloud repository (e.g., S3 buckets) or copy and paste a URL. BigML automatically identifies the field types. Field types and other source parameters can be configured by clicking in the source configuration option.

2. Create a Dataset

From your source view, use the 1-click dataset option to create a dataset, a structured version of your data ready to be used by a Machine Learning algorithm.


In the dataset view you will be able to see a summary of your field values, some basic statistics and the field histograms to analyze your data distributions. This view is really useful to see any errors or irregularities in your data. You can filter the dataset by several criteria and create new fields using different pre-defined operations.


Once your data is clean and free of errors you can split your dataset in two different subsets: one for training your model, and the other for testing. It is crucial to train and evaluate your model with different data to ensure it generalizes well against unseen data. You can easily split your dataset using the BigML 1-click option, which randomly sets aside 80% of the instances for training and 20% for testing.


3. Create a Logistic Regression

Now you are ready to create the Logistic Regression using your training dataset. You can use the 1-click Logistic Regression option, which will create the model using the default parameter values. If you are a more advanced user and you feel comfortable tuning the Logistic Regression parameters, you can do so by using the configure Logistic Regression option.


Find below a list containing a brief summary for each of the configuration parameters. If you want to learn more about them please check the Logistic Regression documentation .

  • Objective field: select the field you want to predict. By default BigML will take the last valid field in your dataset. Remember it must be categorical!

  • Default numeric value: if your numeric fields contain missing values, you can easily replace them by the field mean, median, maximum, minimum or zero using this option. It is inactive by default.

  • Missing numerics: if your numeric fields contain missing values but you think they have a meaning to predict the objective field, you can use this option to include them in the model. Otherwise, instances with missing numerics will be ignored. It is active by default.

  • Eps: set the value of the stopping criteria for the solver. Higher values can make the model faster, but they may result in a poorer predictive performance. You can set a float value between 0 and 1. It is set to 0.0001 by default.

  • Bias: include or exclude the intercept in the Logistic Regression formula. Including it yields better results in most cases. It is active by default.

  • Auto-scaled fields: automatically scale your fields so they all have the same magnitudes. This will also allow you to compare the field coefficients learned by the model afterwards. It is active by default.

  • Regularization: prevent the model from overfitting by using a regularization factor. You can choose between L1 and L2 regularization. The former usually gives better results. You can also tweak the inverse of the regularization strength.

  • Field codings: select the encoding option that works best for your categorical fields. BigML will automatically transform your categorical values into 0 -1 variables to support non-numeric fields as inputs, which is a method known as one-hot encoding. Alternatively, you can choose among three other types of codings: dummy coding, contrast coding and other coding. You can find a detailed explanation of each one in the documentation.

  • Sampling options: if you have a very large dataset, you may not need all the instances to create the model. BigML allows you to easily sample your dataset at the model creation time. 

At this point you may be wondering… ok, so which parameter values should I use?

Unfortunately, there is not a universal response for that. It depends on the data, the domain and the use case you are trying to solve. Our recommendation is that you try to understand the strengths and weaknesses of your model and iterate trying different features and configurations. To do this, the model visualizations explained in the next point play an essential role.

4. Analyze your Results

When your Logistic Regression has been created you can use BigML’s insightful visualizations to dive into the model results and see the impact of your features on model predictions. Take into account that most of the time, the greatest gains on performance come from feature selection and feature engineering, which can be the most time consuming part of the Machine Learning process. Analyzing the results carefully, and inspecting your model to understand the reasons behind the predictions is key in further validating the findings in contrast to expert opinion.

BigML provides a 1D and 2D chart and the coefficient table to analyze your results.

1D and 2D Chart

The Logistic Regression chart provides a visual way to analyze the impact of one or more fields on predictions.

For the 1D chart you can select one input field in the x-axis. In the prediction legend to the right, you will see the objective class predictions as you mouse over the chart area. 


For the 2D chart you can select two input fields, one per axis and the objective class predictions will be plotted in the color heat map chart.


By setting the values for the rest of input fields using the form below the prediction legend, you will be able to inspect the combined interaction of multiple fields on predictions.

Coefficients table

BigML also provides a table to display the coefficients learned by the Logistic Regression. Each coefficient has an associated a field (e.g., checking_status) and an objective field class (e.g., bad, good etc.). A positive coefficient indicates a positive correlation between the input field and the objective field class, while a negative coefficient indicates a negative relationship.

image 11.png

To find out more about the interpretation of the Logistic Regression chart and coefficient table results, follow our next blog post of this series: Predicting Airbnb Prices with BigML Logistic Regression.

5. Evaluate the Logistic Regression

Like any supervised learning method, Logistic Regression needs to be evaluated. Just click on the evaluate option in the 1-click menu and BigML will automatically select the remaining 20% of the dataset that you set aside for testing.

image 9.png

The resulting performance metrics to be analyzed are the same ones as for any other classifier predicting a categorical value.

You will get the confusion matrix containing the true positives, false positives, true negatives and false negatives along with the classification metrics: precision, recall, accuracy, f-measure and phi-measure. For a full description of the confusion matrix and classification measures see the corresponding documentation.

image 10.png

Rinse and repeat! As we mentioned at the end of step 3, repeat steps from 3 to 5 trying out different configurations, different features, etc. until you have a good enough model.

6. Make Predictions

When you finally reach a satisfying model performance, you can start making predictions with it. In BigML, you can make predictions for a new single instance or multiple instances in batch. Let’s take a quick look to both of them!

Single predictions

Click in the Predict option and set the values for your input fields.


A form containing all your input fields will be displayed and you will be able to set the values for a new instance. At the top of the view you will see the objective class probabilities changing as you change your input field values.

image 14.png

Batch predictions

Use the Batchprediction option in the 1-click menu and select the dataset containing the instances for which you want to know the objective field value.


You can configure several parameters of your batch prediction like the possibility to include all class probabilities in the batch prediction output dataset and file. When your batch prediction finishes you will be able to download the CSV file and see the output dataset.

image 15.png

In the next post we will cover a real use case using Logistic Regression to predict Airbnb prices to delve into the Logistic Regression results interpretation.

If you want to learn more about Logistic Regression join the free live webinar next Wednesday, September 28, 2016 at 10:00AM US Pacific Time (Portland, Oregon / GMT -08:00) / 7:00 PM CET (Valencia, Spain. GMT +02:00). Register now as space is limited!


BigML for Alexa: The First Voice Controlled Predictive Assistant




The promise of voice recognition has been around for a long time, but it has always been quite miserable. In fact, just back in 2012 my daughter and I were helping my mother purchase a new car. I paired my phone to the in-car audio and tried to dial home. After several attempts, we were nearly in tears from laughing at how impossibly bad it was at recognizing the spoken phone number.

However, in the last few years, advances in Machine Learning have improved the capability of voice recognition dramatically; see for example the section about the history of Siri here. Even more importantly, the availability of voice recognition APIs like Amazon’s Alexa Voice Service have made it possible for the rapid adoption of voice controlled applications.

But what about that moment in Star Trek IV, The Voyage Home, when Scotty not only expects to be able to speak to the computer, but to have the computer reply intelligently? To get there we need to not only rely on machine learning for voice recognition, but to bring voice recognition to machine learning applications!

As of today, we are one step further along that path with the introduction of the BigML for Alexa skill.

The BigML for Alexa skill combines the predictive power of BigML with the voice processing capabilities of the Alexa Voice Service. Using an Alexa enabled device like an Amazon Echo or Dot, this integration makes it possible to use spoken questions and answers to generate predictions using your own models trained in BigML.

For example, if you have data regarding wine sales with features like the sale month and grape variety, you could build a model which predicts the sales for a given month, variety, etc. With this model loaded into the BigML for Alexa skill, you could generate a sales prediction by answering questions vocally.


bigml_alexaIf you already have an AVS device like the Amazon Echo, you can quickly get a feel for the capabilities of the BigML Alexa skill in two steps:

First, enable the skill with:

“Alexa, enable the Big M. L. skill”

Then you can run a demo with:

“Alexa, ask Big M. L. to give me a demo”

This will load a model which ask questions about a patient’s diagnostic measurements like the 4-hour plasma glucose and BMI and uses your answers to make a prediction about the likelihood of that individual having diabetes. Of course, keep in mind that this is only a demo and is not medical advice!

If you want to try the BigML Alexa skill with your own BigML models, you just need to link the skill to your BigML account:


And then ask to load the latest model with

“Alexa, ask Big M. L. to load the latest model”.

This will load your most recently created model and launch a prediction.

As you start to play with your own models, you may run into some quirks with how field names are spoken, especially if they have punctuation or abbreviations. No worries – you can control how the fields are spoken using the labels and descriptions in your BigML dataset.

How to do this and lots of other tips and tricks can be found in the BigML for Alexa documentation

Now we just need the formula for transparent aluminum!

Introduction to Logistic Regression

This Summer 2016 Release BigML is bringing Logistic Regression to the Dashboard, a very popular supervised Machine Learning method for solving classification problems. This upcoming release is the perfect scenario to guide you through Logistic Regression step by step. That is why we are presenting several blog posts to introduce you to this Machine Learning method.

Within this first post you will have a general overview of what Logistic Regression is. In the coming days, we will be complementing this post with five more entries: a second post that will take you through the six necessary steps to get started with Logistic Regression, a third post about how to make predictions with BigML’s Logistic Regression, a fourth blog post about how to create a Logistic Regression using the BigML API and a fifth one that will explain the same process using WhizzML instead, and finally, a sixth post that will analyze the difference between Logistic Regression and Decision Trees.

Let’s get started with Logistic Regression!


Why Logistic Regression?

Before machine learning hit the scene, the go-to tool for statistical modelling was regression analysis. Regressions aim to model the behavior of an objective variable  as a combination of effects from a number of predictor variables. Among these tried-and true techniques is Logistic Regression, which was originally developed by statistician David Cox in 1958. Logistic regression is used to solve classification problems, where the objective is a categorical variable. Let’s see a simple example.


The dataset we’ll work with contains health statistics from 768 people belonging to the Pima Native American ethnic group. Our objective is to model the effect of an individual’s plasma glucose level on whether that individual contracts diabetes. The above scatterplot shows the plasma glucose levels of individuals with and without diabetes. At a glance, we can see that some relationship exists, where higher levels of plasma glucose are indicative of having diabetes. Note that while the x-axis is numeric in the scatterplot, the y-axis is categorical. This means we’ll need to apply a transformation before we can encode this relationship numerically. Rather than relating glucose levels directly to true/false values, we model the probability of diabetes as a function of plasma glucose. Speaking in terms of probability is appropriate because, as we can see in the above graph, there is no clear-cut threshold on plasma glucose beyond which we can say a person will have diabetes. Most individuals with plasma glucose below 80 mmol/L do not have diabetes, while most with levels above 180 mmol/L have diabetes. Within that range however, there is a significant amount of overlap. We need a way to express this interval of fuzziness flanked by two zones of certainty. For this purpose, we’ll use a function called the logistic function.


In this single-predictor example, our regression function is characterized by only two parameters: the slope of the transition and where the transition point is located along the x-axis. Fitting a logistic regression is simply learning the values of these parameters. The ability to encapsulate the model in only two numbers is one of the main selling points of logistic regression. Check out these slides from the Valencian Summer School in Machine Learning 2016 for more details and examples.

Having said that, in this first post we won’t go too deep into Logistic Regression. Sometimes simplicity is the key to understand the basic concepts:

  • The aim of a Logistic Regression is to model the probability of an event that occurs depending on the values of the independent variables.
  • A Logistic Regression estimates the probability that an event occurs for a randomly selected observation versus the probability of this event not occurring at all.
  • A Logistic Regression classifies observations by estimating the probability that an observation is in a particular category.

These videos from Brandon Foltz offer a deeper dive into the essence of Logistic Regression:


Want to know more about Logistic Regression?

Don’t miss our next posts, and of course, join our free live webinar next Wednesday, September 28, 2016 at 10:00AM US Pacific Time (Portland, Oregon / GMT -08:00) / 7:00 PM CET (Valencia, Spain. GMT +02:00). Register now as space is limited!

BigML Summer 2016 Release and Webinar: Logistic Regression and more!

BigML’s Summer 2016 Release is here! Join us on Wednesday, September 28, 2016 at 10:00AM US Pacific Time (Portland, Oregon / GMT -08:00) / 7:00 PM CET (Valencia, Spain. GMT +02:00) for a FREE live webinar to learn about the newest version of BigML. We’ll be diving into Logistic Regression, one of the most popular supervised Machine Learning methods for solving classification problems.


Last Fall we launched Logistic Regressions in the BigML API to let you easily create and download models to your environment for fast, local predictions. With this Summer Release, we go a step further by bringing Logistic Regression to the BigML Dashboard. This new and intuitive Dashboard visualization includes a chart and a coefficients table. The former lets you analyze the impact of an input field in the objective field predictions, whereas the table shows all the coefficients learned for each of the logistic function variables, ideal for inspecting model results and debugging tasks.


You can plot the impact of the input fields in either one or two dimensions, simply select the desired option with the green slider. On the right-hand-side legend you will see the class probabilities change according to the input fields selected as the axis. You can also set the values for the rest of the input fields using the form below the legend.


The ultimate goal of creating a Logistic Regression is to make predictions with it. You can easily predict single instances using the BigML prediction form, just input the values for the fields used by the Logistic Regression and you will get an immediate response of the predicted class along with its probability. BigML also provides the probabilities for the rest of classes in the objective field in a visual histogram that changes in real-time as you configure the input field values.


In addition to commercial activities, BigML plays an active role in promoting Machine Learning for education. With special offers, our education program is rapidly expanding around the world thanks to the participation from top universities all around the World. We would like to spread the word to more students, professors and academic researchers with your help, so please feel free to refer your fellow educators.

Are you ready to discover all you can do with Logistic Regressions? Join us on Wednesday, September 28, 2016 at 10:00AM PDT (Portland, Oregon. GMT -07:00) / 7:00 PM CET (Valencia, Spain. GMT +02:00). Be sure to reserve your free spot today as space is limited! We will also be giving away BigML t-shirts to those who submit questions during the webinar. Don’t forget to request yours!

2nd Valencian Summer School in Machine Learning: Done and Dusted!

This week over 140 attendees representing 53 companies and 21 academic organizations from 19 countries, gathered in Valencia to get their hands dirty with a curriculum jam packed with practical Machine Learning techniques and case studies that they can put to good use where they work or teach.


The regularly scheduled sessions were capped with an additional surprise talk from BigML’s Strategic Advisor Professor Geoff Webb about Multiple Test Correction for Streams and Cascades of Statistical Hypothesis Tests that he has been developing as part of his recent academic research.

The diverse backgrounds of the attendees and their active participation and willingness to absorb Machine Learning knowledge have jointly served to prove that the proverbial Machine Learning genie is out of the bag never again to be solely confined to small academic and scientific circles. The writing is already on the wall. In today’s knowledge economy driven increasingly by smart applications, Machine Learning is no longer an elective. Rather, it’s one of the main courses to be mastered by developers, engineers, information technology professionals, analysts, and even hands-on functional specialists from areas as varied as marketing, sales, supply chain, operations, finance or human resources. We thank all of our graduates for their enthusiasm as well as their valuable feedback teaching us a few new things in the process.


As the BigML family, we wish to stay connected for new editions of our training events to be held in larger and larger venues! THANK YOU VERY MUCH!

Introducing the Artificial Intelligence Startup Battle in Boston on October 12 at PAPIs ‘16

Telefónica Open Future_, Telefónica’s startup accelerator that helps the best entrepreneurs grow and build successful businesses, and invite you to participate in the Artificial Intelligence Startup Battle of PAPIs ‘16, the 3rd International Conference on Predictive Applications and APIs, to be held in Boston on October 12 at the Microsoft New England Research and Development Center.


Artificial Intelligence (AI) has a track-record of improving the way we make decisions. So why not use it to decide which startups to invest in, and take advantage of all the startup data that is available? The AI Startup Battle, powered by PreSeries  (a joint venture between BigML and Telefónica Open Future_), is a unique experience you don’t want to miss, where you’ll witness real-world and high-stakes AI.

As an early stage startup, you will enjoy a great opportunity to secure seed investment, and get press coverage in one of the technology capitals of the world. On the other hand, attendees will discover disruptive innovation from the most promising startups in AI, as the winner will be chosen by an impartial algorithm that evaluates startups’ chances of success based on signals derived from decades of entrepreneurial undertakings.

Want to compete in the AI Startup Battle?

If you are a startup with applied AI and Machine Learning as a core component of your offering, then we’ll be happy to meet you! Submit your application and if you are selected, you’ll be able to pitch on stage, make connections at the conference, and get unique exposure among a highly distinguished audience.

Five Artificial Intelligence startups will be selected to present their projects on stage on October 12 at the closing of the PAPIs ‘16 conference. They will be automatically judged by an application that uses a Machine Learning algorithm to predict the probability of success of a startup, without human intervention.

The five startups selected to present will get a free exhibitor package at PAPIs worth $4,000 each.

The winner of the battle will be invited to Telefonica Open Future_’s acceleration program and will receive funding of up to $50,000. The winner will not only enjoy an incredible place to work but also access to mentors, business partners, a global network of talent as well as the opportunity to reach millions of Telefónica customers.

NOTE: During the acceleration program, part of the startup team should be working from one of the countries where Telefónica operates (i.e., Argentina, Brazil, Chile, Colombia, Germany, Mexico, Peru, Spain, United Kingdom, or Venezuela).

Disrupting early stage technology investments

It only seems like yesterday when Telefónica Open Future_ sponsored the World’s first AI Startup Battle last March in Valencia, when we first introduced PreSeries to the world. The event was the warm-up round to the upcoming one in Boston. The Madrid based winner, Novelti who scored a competition high 86 points, will now be taking on a new set of competitors in Boston, as we find out some of the most promising early stage startups in Artificial Intelligence and Machine Learning in North America, Europe and the rest of the world.

As a bonus, the minds behind PreSeries will take the center stage to speak about their technology architecture, e.g., the supporting data, the training of the model, and its evaluation framework. It is a rare opportunity to find out what goes on behind the scenes in delivering this innovate real-life predictive application.

Startup Battle highlights:


  • AI Startup Battle.


  • Wednesday, October 12, 2016 from 5:00 PM to 6:30 PM (EDT).


  • Microsoft New England Research and Development Center – 1 Memorial Dr #1 1st floor, Cambridge, Massachusetts, 02142. USA.


  • To apply to present at the battle, please fill out the application form  before September 29th. Spots are limited and will be awarded on a first come first serve basis.




PAPIs ’16

The scenario of the AI Startup Battle could not be more innovative. The battle is part of PAPIs ’16, the 3rd International Conference on Predictive Applications and APIs. It is a community conference dedicated to real-world Machine Learning and related intelligent applications. Subject-matter experts and leading practitioners from around the world will fly to Boston to discuss new developments, opportunities and challenges in this rapidly evolving space. The conference features tutorials and talks for all levels of experience, and networking events to help you connect with speakers, exhibitors, and other attendees.

As PAPIs conference series makes its debut in the United States, there are also some changes including a pre-conference training day on October 10, 2016.  Curriculum includes Operational Machine Learning with Open Source & Cloud Platforms, where participants will learn about the possibilities of Machine Learning, how to create predictive models from data, operationalize and evaluate them.

Registration to PAPIs ’16 is separate. You can find out more details here.

Must See Machine Learning Talk by Geoff Webb in Valencia

BigML and Las Naves are getting ready to host the 2nd Machine Learning Summer School in Valencia (September 8-9), which is fully booked. Although we are not able to extend any new invitations for the Summer School, we are happy to share that BigML’s Strategic Advisor Professor Geoff Webb (Monash University, Melbourne) will be giving an open talk on September 8th at the end of the first day of the Summer School.  All MLVLC meetup members are cordially invited to attend this talk, which will start promptly at 6:30 PM CEST, in Las Naves. After Professor Webb’s talk, there will be time allocated for free drinks and networking. Below are the details of this unique talk.

A multiple test correction for streams and cascades of statistical hypothesis tests

Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e. rejecting a null hypothesis in error. Family-wise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be predetermined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance.

Subfamilywise Multiple Testing is a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed.

To demonstrate the importance and relevance of this work to current machine learning problems, Professor Webb and co-authors further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models.

They assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state-of-the-art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.

Please RSVP for this talk soon and be sure to take advantage of this unique chance to learn more about theis cutting edge technique, while joining our Summer School attendees from around the world for a stimulating session of networking afterwards.

The Ghost Olympic Event: Machine Learning Startup Acquisition

With no lack of drama both on and off the track, the 31st Summer Olympics and its 39 events have been wrapped up recently. As the city of Rio is preparing for the first Paralympic Games to take place in the Southern Hemisphere, some are experiencing Synchronized Swimming, Canoe Slalom and Modern Pentathlon withdrawal symptoms.  As Usain Bolt, Michael Phelps and Simone Biles stole the show, Silicon Valley has not just quietly sat and watched the proceedings. Not at all.  In fact, VCs, investment bankers and tech giants active in the Machine Learning space have been in a race of their own that goes on unabated even if they don’t get the benefit of prime time NBC TV coverage.

olimpic inverse musical chairs

Machine Learning as strategic weapon

It is fair to say that we have been witnessing the unfolding of the ghost Olympic event of Machine Learning startup acquisition.  The business community’s scores are not fully revealed yet and the acquisition amounts are mostly being kept under the wraps — albeit in a leaky kind of way.  Regardless, the most recent acquirers include Apple acquiring Turi and Gliimpse, Salesforce purchasing BeyondCore, Intel picking up Nervana Systems, and Genee being scooped up by Microsoft. So what is driving this recent surge?

The bulk of the M&A activity have been led by household B2C names like Google, Apple and Facebook that are sitting on top of piles of consumer data that can result in a new level of innovation when coupled with existing as well as emerging Machine Learning techniques like Deep Learning.  The dearth of talent to make this opportunity a reality has resulted in a very uneven distribution of the said talent as those deep pocketed “acquihirers” outbid other suitors to the tune of $10M per FTE  for early stage startups (and even higher in the case of accomplished academic brains).

The emerging need for a platform approach

As great as having some of the brightest minds work on complex problems is, it is no guarantee of success without the right tools and processes to maximize the collaboration with and the learning among developers, analysts and resident subject-matter experts.  Indeed, the best way to scale and amplify the impact from the efforts of these highly capable, centralized yet still relatively tiny teams is adopting a Machine Learning platform approach.

It turns out that those that started on the path of prioritizing Machine Learning as a key innovation enabler early on already have poured countless developer man-years into building their own platforms from scratch. Facebook’s FbLearner Flow, which the company recently revealed is a great example of this trend. As of now the platform claims to have supported over 1 million modeling experiments conducted to date, which make 6 million predictions per second possible for various Facebook modules such as the news feed. But perhaps the most impressive statistic is that 25% of Facebook engineers have become users of the platform over the years. This is very much in line with Google’s current efforts to train more developers to help themselves when it comes to building Machine Learning powered smart features and applications.

Machine Learning haves (1%) and have nots (99%)

Examples like the above are inspirational, but this brings the question how many companies can realistically afford to build their own platform from scratch. The short answer is “Not too many!”

Left to their own devices, these firms face the following options:

  • Hiring few Data Scientists that may each bring their own open source tools and libraries of varying levels of complexity potentially limiting the adoption of Machine Learning in other functions of the organization, where the ownership of mission critical applications and core industry expertise reside.

  • Turn to commercial point solution providers with a few built in blackbox Machine Learning driven use cases per function e.g., HR, Marketing, Sales etc.

  • Count on the larger B2B players’ recently launched Machine Learning platforms to catch up and mature in a way that can not only engage highly experienced Machine Learning specialists, but also serve the needs of developers and analysts alike e.g., IBM, Microsoft (Azure), Amazon (AWS) etc.

Although these options may be acceptable ways to dip your toes in the water or stop the bleeding in going to market with a very specific use, they are not satisfactory longer term approaches that strike the optimal balance between time to market, return on investment and a collaborative transformation that leads to a data driven culture of continuous innovation that transcends what can be achieved with small teams of PhDs. As a result, despite the recent advances in data collection, storage and processing, we are stuck with a data rich but insights (and business outcomes) poor environment awash with a cacophony of buzzwords in many industries.

Luckily, there’s still an incipient industry of independent Machine Learning platforms like BigML, H2O and Skytree (no more Turi) that can supply this unfulfilled demand from the so far lagging 99%. However, we must remember that replacing those platforms with new complete ones may require years of arduous work by highly specialized teams, which runs counter to the present day two co-founder, Silicon Valley accelerator startup recipe targeting a quick exit despite little to no Intellectual Property.

Regardless if any tech bellwether is able to create a monopoly, it seems safe to assume that for the foreseeable future the race for Machine Learning talent is only going to get hotter as more companies get a taste of its value. We will all see whether this game of inverse musical chairs will lasts long enough to make it to the official program of Tokyo 2020!

%d bloggers like this: