Skip to content

Linear Regression: A Technical Overview

BigML has added multiple linear regression to its suite of supervised learning methods. In this sixth and final blog post of our series, we will give a rundown of the technical details for this method.

Model Definition

Given a numeric objective field y, we model its response as a linear combination of our inputs x_1,\cdots,x_n, and an intercept value \beta_0.

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n = \beta_0 + \sum_{i=1}^n \beta_i x_i

Simple Linear Regression

For illustrative purposes, let’s consider the case of a problem with a single input. We can see that the above expression then represents a line with slope \beta_1 and intercept \beta_0.

y = \beta_0 + \beta_1 x

The task now is to find the values of \beta_0, \beta_1 that parameterize a line which is the best fit for our data. In order to do so we must obtain a metric which quantifies how well a given line fits the data.


Given a candidate line, we can measure the vertical distance between the line and each of our data points. These distances are called residuals. Squaring the residual for each data point and computing the sum, we get our metric.

S = \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2

As one might expect, the sum of squared residuals is minimized when \beta_0, \beta_1 define a line that passes more or less thorough the middle of the data points.

Multiple Linear Regression

When we deal with multiple input variables, it becomes more convenient to express the problem using vector and matrix notation. For a dataset with n rows and p inputs, define \mathbf{y} as a column vector of length n containing the objective values, \mathbf{X} as a n \times p matrix where each row corresponds to a particular input instance, and \mathbf{\beta} as a column vector of length p containing values of the regression coefficients. The sum of squared residuals can thus be expressed as:

S = ||\mathbf{y - X\beta}||_2^2

The value of \mathbf{\beta} which minimizes this is given by the closed-form expression:

\mathbf{\beta = (X^T X)^{-1} X^T y}

The matrix inverse is the most computationally intensive portion of solving a linear regression problem. Rather than directly constructing the matrix \mathbf{X} and performing the inverse, BigML’s implementation uses an orthogonal decomposition which can be incrementally updated with observed data. This allows for solving linear regression problems with datasets which are too large to fit into memory.


Predicting new data points with a linear regression model is just about as easy as it can get. We simply take the coefficients \beta_0,\ldots,\beta_n from the model and evaluate the regression equation above to obtain a predicted value for y. BigML also returns two metrics that describe the quality of the prediction: the confidence interval and the prediction interval. These are illustrated in the following figure:


These two intervals carry different meanings. Depending on how the predictions are to be used, one will be more suitable than the other.

The confidence interval is the narrower of the two. It gives the 95% confidence range for the mean response. If you were to sample a large number of points at the same x-coordinate, there is a 95% probability that the mean of their y values will be within this range.

The prediction interval is the wider interval. For a single point at the given x-coordinate, its y value will be within this range with 95% probability.

BigML Field Types and Linear Regression

In the regression equation, all of the input variables x_n are numeric values. Naturally, BigML’s linear regression model also supports categorical, text, and items fields as inputs. If you have seen how our logistic regression models handle these inputs, then this will be mostly familiar, but there are a couple important differences.

Categorical Fields

Categorical fields are transformed to numeric values via field codings. By default, linear regression uses a dummy coding system. For a categorical field with class values, there will be n-1 numeric predictor variables x. We designate one class value as the reference value (by default the first one in lexicographic order). Each of the predictors corresponds to one of the remaining class values, taking a value of 1 when that value appears and 0 otherwise. For example, consider a categorical field with values “Red”, “Green”, and “Blue”. Since there are 3 class values, dummy coding will produce 2 numeric predictors x1 and x2. Assuming we set the reference value to “Red”, each class value produces the following predictor values:

Field value x1 x2
Red 0 0
Green 1 0
Blue 0 1

Other coding systems such as contrast coding are also supported. For more details check out the API documentation.

Text and Items Fields

Text and items fields are treated in the same fashion. There will be one numeric predictor for each term in the tag cloud/items list. The value for each predictor is the number of times that term/item occurs in the input.

Missing Values

If an input field contains missing values in the training data then an additional binary-valued predictor will be created which takes a value of 1 when the field is missing and 0 otherwise.  The value for all other predictors pertaining to the field will be 0 when the field is missing. For example, a numeric field with missing values will have two predictors: one for the field itself plus the missing value predictor. If the input has a missing value for this field, then its two predictors will be (0,1), in contrast, if the field is not missing, but equal to zero, then the predictors will be (0,0).

Wrap Up

That’s pretty much it for the nitty-gritty of multiple linear regression. Being a rather venerable machine learning tool, its internals are relatively straightforward. Nevertheless, you should find that it applies well to many real-world learning problems. Head over to the dashboard and give it a try!

Automating Linear Regressions with WhizzML & Python Bindings


This blog post, the fifth of our series of six posts about Linear regressions, focuses on those users that want to automate their Machine Learning workflows using programming languages. If you follow the BigML blog, you may already be familiar with WhizzML, BigML’s domain-specific language for automating Machine Learning workflows, implementing high-level Machine Learning algorithms, and easily sharing them with others. WhizzML helps developers create Machine Learning workflows and execute them entirely in the cloud. This avoids network problems, memory issues and lack of computing capacity while taking full advantage of WhizzML’s built-in parallelization. If you aren’t familiar with WhizzML yet, we recommend that you read the series of posts we published this summer about how to create WhizzML scripts: Part 1, Part 2 and Part 3 to quickly discover the benefits.

Screen Shot 2017-03-15 at 01.51.13To help automate the manipulation of BigML’s Machine Learning resources, we also maintain a set of bindingswhich allow users to work in their favorite language (Java, C#, PHP, Swift, and others) with the BigML platform.

Let’s see how to use Linear Regressions through both the popular BigML Python Bindings and WhizzML. Note that the operations described in this post are also available in this list of bindings.

The first step is creating Linear Regressions with the default settings. We start from an existing Dataset to train the model in BigML so our call to the API will need to include the Dataset ID we want to use for training as shown below:

;; Creates a linearregression with default parameters
(define my_linearregression
  (create-linearregression {"dataset" training_dataset}))

The BigML API is mostly asynchronous, that is, the above creation function will return a response before the Linear Regression creation is completed, usually the response informs that creation has started and the resource is in progress. This implies that the Linear Regression is not ready to predict with it right after the code snippet is executed, so you must wait for its completion before you can start with the predictions. A way to get it once it’s finished is to use the directive “create-and-wait-linearregression” for that:

;; Creates a linearregression with default settings. Once it's
;; completed the ID is stored in my_linearregression variable
(define my_linearregression
  (create-and-wait-linearregression {"dataset" training_dataset}))

If you prefer to use the Python Bindings, the equivalent code is this:

from bigml.api import BigML
api = BigML()

my_linearregression = \

Next up, we will configure some properties of a Linear Regression with WhizzML. All the configuration properties can be easily added using property pairs such as <property_name> and <property_value> as in the example below. For instance, to create an optimized Linear Regression from a dataset, BigML sets the number of model candidates to 128. If you prefer a lower number of steps, you should add the property “number_of_model_candidates and set it to 10. Additionally, you might want to set the value used by the Linear Regression when numeric fields are missing. Then, you need to set thedefault_numeric_valueproperty to the right value. In the example below, it’s replaced by the mean value.

;; Creates a linearregression with some settings. Once it's
;; completed the ID is stored in my_linearregression variable
(define my_linearregression
  (create-and-wait-linearregression {"dataset" training_dataset
                            "number_of_model_candidates" 10
                            "default_numeric_value" "mean"}))

NOTE: Property names always need to be between quotes and the value should be expressed in the appropriate type, a string or a number in the previous example. The equivalent code for the BigML Python Bindings becomes:

from bigml.api import BigML
api = BigML()
args = {"max_iterations": 100000, "default_numeric_value": "mean"}
training_dataset ="dataset/59b0f8c7b95b392f12000000"
my_linearregression = api.create_prediction(training_dataset, args)

For the complete list of properties that BigML offers, please check the dedicated API documentation.

Once the Linear Regression has been created,  as usual for supervised resources, we can evaluate how good its performance is. Now, we will use a different dataset with non-overlapping data to check the Linear Regression performance.  The “test_dataset” parameter in the code shown below represents the second dataset. Following the motto of “less is more”, the WhizzML code that performs an evaluation has only two mandatory parameters: a Linear Regression to be evaluated and a Dataset to use as test data.

;; Creates an evaluation of a linear regression
(define my_linearregression_ev
 (create-evaluation {"linearregression" my_linearregression "dataset" test_dataset}))

Handy, right? Similarly, using Python bindings, the evaluation is done with the following snippet:

from bigml.api import BigML
api = BigML()
my_linearregression = "linearregression/59b0f8c7b95b392f12000000"
test_dataset = "dataset/59b0f8c7b95b392f12000002"
evaluation = api.create_evaluation(my_linearregression, test_dataset)

Following the steps of a typical workflow, after a good evaluation of your Linear Regression, you can make predictions for new sets of observations. In the following code, we demonstrate the simplest setting, where the prediction is made only for some fields in the dataset.

;; Creates a prediction using a linearregression with specific input data
(define my_prediction
 (create-prediction {"linearregression" my_linearregression
                     "input_data" {"sepal length" 2 "sepal width" 3}}))

The equivalent code for the BigML Python bindings is:

from bigml.api import BigML
api = BigML()
input_data = {"sepal length": 2, "sepal width": 3}
my_linearregression = "linearregression/59b0f8c7b95b392f12000000"
prediction = api.create_prediction(my_linearregression, input_data)

In both cases, WhizzML or Python bindings, in the input data you can use either the field names or the field IDs. In other words, “000002”: 3 or “sepal width”: 3 are equivalent expressions.

As opposed to this prediction, which is calculated and stored in BigML servers, the Python Bindings (and other available bindings) also allow you to instantly create single local predictions on your computer or device. The Linear Regression information will be downloaded to your computer the first time you use it (connectivity is needed only the first time you access the model), and the predictions will be computed locally on your machine, without any incremental costs or latency:

from bigml.linearregression import Linearregression
local_linearregression = Linearregression("linearregression/59b0f8c7b95b392f12000000")
input_data = {"sepal length": 2, "sepal width": 3}

It is similarly pretty straightforward to create a Batch Prediction in the cloud from an existing Linear Regression, where the dataset named “my_dataset” contains a new set of instances to predict by the model:

;; Creates a batch prediction using a linearregression 'my_linearregression'
;; and the dataset 'my_dataset' as data to predict for
(define my_batchprediction
 (create-batchprediction {"linearregression" my_linearregression
                          "dataset" my_dataset}))

The code in Python Bindings that performs the same task is:

from bigml.api import BigML
api = BigML()
my_linearregression = "linearregression/59d1f57ab95b39750c000000"
my_dataset = "dataset/59b0f8c7b95b392f12000000"
my_batchprediction = api.create_batch_prediction(my_linearregression, my_dataset)

Want to know more about Linear Regressions?

Our next blog post, the last one of this series, will cover how Linear regressions work behind the scenes, diving into the technical implementation aspects of BigML’s latest resource. If you have any questions or you’d like to learn more about how Linear Regressions work, please visit the dedicated release page. It includes links to this series of six blog posts, in addition to the BigML Dashboard and API documentation.

Machine Learning Boosts Startups and Industry

BigML, the leading Machine Learning platform, and GoHub from Global Omnium join forces with a strategic partnership to boost Machine Learning adoption throughout the startup and industry sectors. This partnership helps the tech and business sectors apply Machine Learning in their companies, provides them with Machine Learning education and helps them remain competitive in the marketplace.

BigML can now enjoy the GoHub offices, the new open innovation hub created by Global Omnium (the leading company in the water sector) and first startup accelerator specialized in Machine Learning with the collaboration of BigML. BigML, with headquarters in Corvallis, Oregon. USA, and Valencia, Spain, offers since 2011 the leading Machine Learning platform that helps almost 90.000 analysts, scientists and developers worldwide implement their own predictive applications with BigML technology.

Plenty of startups and many business sectors already apply Machine Learning techniques to automate processes in HR departments to hire the right employees; predict demand to avoid inventory problems; perform predictive maintenance to avoid production loss; timely fraud detection; predict energy savings, among many other Machine Learning applications in the real world.

BigML and GoHub will launch a full program of Machine Learning activities and events that will focus on providing the best quality content for institutions, big corporations, middle and small companies as well as startups. To achieve this goal, there will be events in Valencia, across Spain, and in several countries that will be announced shortly and will explain the impact that Machine Learning is having and can have on businesses for industry, finance, marketing, Human Resources, security, and more. All of them to be explained by companies that are already working with this technology. Additionally, there will be Machine Learning workshops to help ease the learning curve of those companies that wish to learn and implement Machine Learning.

Moreover, this partnership will encourage the creation of new synergies between products and services from both parties, GoHub and BigML. An example of this is the IoT Industrial platform Nexus Integra, which is already creating its own predictive application with BigML.

Francisco Martin, BigML’s CEO highlights: “I’m very happy that in Valencia there are companies like Global Omnium that champion and work hard for these disruptive technologies to succeed, allowing local talent to produce high-quality exportable technology bringing wealth to the city. With that, young talent won’t have the urge to emigrate to other countries in order to find a good job (as I did some time ago), which is a very positive development.”

Patricia Pastor, GoHub Director says: “Innovation ecosystems, especially when there is a collaboration between startups and big corporations, are a great engine of growth and competitiveness. These kinds of partnerships are key to accelerating the potential of such ecosystems. Having BigML as a partner will allow the Valencian ecosystem to reach a higher level when it comes to disruptive technologies.”

Programming Linear Regressions

In this fourth post of our series, we want to provide a brief summary of all the necessary steps to create a Linear Regression using the BigML API. As mentioned in our earlier posts, Linear Regression is a supervised learning method to solve regression problems, i.e., the objective field must be numeric.

The API workflow to create a Linear Regression and use it to make predictions is very similar to the one we explained for the Dashboard in our previous post. It’s worth mentioning that any resource created with the API will automatically be created in your Dashboard too so you can take advantage of BigML’s intuitive visualizations at any time.


In case you never used the BigML API before, all requests to manage your resources must use HTTPS and be authenticated using your username and API key to verify your identity. Find below a base URL example to manage Linear Regressions.$BIGML_USERNAME;api_key=$BIGML_API_KEY

You can find your authentication details in your Dashboard account by clicking in the API Key icon in the top menu.

Screen Shot 2019-02-28 at 6.22.34 PM

The first step in any BigML workflow using the API is setting up authentication. Once authentication is successfully set up, you can begin executing the rest of this workflow.

export BIGML_USERNAME=nickwilson
export BIGML_API_KEY=98ftd66e7f089af7201db795f46d8956b714268a
export BIGML_AUTH="username=$BIGML_USERNAME;api_key=$BIGML_API_KEY;"

1. Upload Your Data

You can upload your data in your preferred format, from a local file, a remote file (using a URL) or from your cloud repository e.g., AWS, Azure etc. This will automatically create a source in your BigML account.

First, you need to open up a terminal with curl or any other command-line tool that implements standard HTTPS methods. In the example below, we are creating a source from a local CSV file containing some house data listed in Airbnb, each row representing one house’s information.

curl "$BIGML_AUTH" -F file=@airbnb.csv

2. Create a Dataset

After the source is created, you need to build a dataset, which serializes your data and transforms it into a suitable input for the Machine Learning algorithm.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"source":"source/5c7631694e17272d410007aa"}'

Then, split your recently created dataset into two subsets: one for training the model and another for testing it. It is essential to evaluate your model with data that the model hasn’t seen before. You need to do this in two separate API calls that create two different datasets.

  • To create the training dataset, you need the original dataset ID and the sample_rate  (the proportion of instances to include in the sample) as arguments. In the example below, we are including 80% of the instances in our training dataset. We also set a particular seed argument to ensure that the sampling will be deterministic. This will ensure that the instances selected in the training dataset will never be part of the test dataset created with the same sampling hold out.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"origin_dataset":"dataset/5c762fcd4e17272d4100072d", 
            "sample_rate":0.8, "seed":"myairbnb"}'
  • For the testing dataset, you also need the original dataset ID and the sample_rate, but this time we combine it with the out_of_bag argument. The out of bag takes the (1- sample_rate) instances, in this case, 1-0.8=0.2. Using those two arguments along with the same seed used to create the training dataset, we ensure that the training and testing datasets are mutually exclusive.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"origin_dataset":"dataset/5c762fcd4e17272d4100072d", 
            "sample_rate":0.8, "out_of_bag":true, "seed":"myairbnb"}'

3. Create a Linear Regression

Next, use your training dataset to create a Linear Regression. Remember that the field you want to predict must be numeric. BigML takes the last numerical field in your dataset as the objective field by default unless it is specified. In the example below, we are creating a Linear Regression including an argument to indicate the objective field. To specify the objective field you can either use the field name or the field ID:

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/68b5627b3c1920186f000325", 

You can also configure a wide range of the Linear Regression parameters at creation time. Read about all of them in the API documentation.

Usually, Linear Regressions can only handle numeric fields as inputs, but BigML automatically performs a set of transformations such that it can also support categorical, text and items input fields. Keep in mind that BigML uses dummy encoding by default, but you can configure other types of transformations using the different encoding options provided.

4. Evaluate the Linear Regression

Evaluating your Linear Regression is key to measure its predictive performance against unseen data.

You need the linear regression ID and the testing dataset ID as arguments to create an evaluation using the API:

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"linearregression":"linearregression/5c762c6b4e17272d42000617",

5. Make Predictions

Finally, once you are satisfied with your model’s performance, use your Logistic Regression to make predictions by feeding it new data. Linear Regression in BigML can gracefully handle missing values for your categorical, text or items fields.

In BigML you can make predictions for a single instance or multiple instances (in batch). See below an example for each case.

To predict one new data point, just input the values for the fields used by the Linear Regression to make your prediction. In turn, you get a prediction result for your objective field along with confidence and probability intervals.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"linearregression":"linearregression/5c762c6b4e17272d42000617",
            "input_data":{"room":4, "bathroom":2, ...}}'

To make predictions for multiple instances simultaneously, use the Linear Regression ID and the new dataset ID containing the observations you want to predict.

curl "$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"linearregression":"linearregression/5c762c6b4e17272d42000617",
            "output_dataset": true}'

If you want to learn more about Linear Regression please visit our release page for documentation on how to use Linear Regression with the BigML Dashboard and the BigML API. In case you still have some questions be sure to reach us at anytime!


Linear Regression in a few Clicks with the BigML Dashboard

This is the third post of our Linear Regression series. BigML is bringing Linear Regression to the Dashboard so that you can solve regression problems with the help of powerful visualizations to inspect and analyze your results. Linear Regression is not only one of the best-known but also one of the best-understood supervised learning algorithms. It has its roots in statistics but gets utilized in machine learning quite a bit as well.

In this post we would like to walk you through the common steps to get started with Linear Regression:

1. Uploading your Data

As usual, start by uploading your data to your BigML account. BigML offers several ways to do it, you can drag and drop a local file, connect BigML to your cloud repository (e.g., S3 buckets) or copy and paste a URL. BigML automatically identifies the field types. Field types and other source parameters can alternatively be configured by clicking in the source configuration option.

2. Create a Dataset

From your source view, use the 1-click dataset option to create a dataset, a structured version of your data ready to be used by a Machine Learning algorithm.


In the dataset view, you will be able to see a summary of your field values, univariate statistics, and the field histograms to analyze your data distributions. This view is really useful to see any errors or irregularities in your data. You can also filter the dataset by several criteria and create new fields using different pre-defined operations as needed.


Once your data is clean and free of errors you can split your dataset into two different subsets: one for training your model, and the other for testing. It is crucial to train and evaluate your model with different data to ensure it generalizes well against unseen data. You can easily split your dataset using the BigML 1-click option, which randomly sets aside 80% of the instances for training and 20% for testing.


3. Create a Linear Regression

Now you are ready to create the Linear Regression using your training dataset. You can use the 1-click Linear Regression option, which will create the model using the default parameter values. However, if you are a more advanced user and you feel comfortable tuning the Linear Regression parameters, you can do so by using the configure Linear Regression option.

The list below gives a brief summary of each of the configuration parameters. If you want to learn more about them please check the Linear Regression documentation.

  • Objective field: select the field you want to predict. By default, BigML will take the last valid field in your dataset. Remember it must be numeric!
  • Default numeric value: if your numeric fields contain missing values, you can easily replace them with the field mean, median, maximum, minimum or zero using this option. It is inactive by default.
  • Weight field: set instance weights using the values of the given field. The value in the weight field specifies the number of times that row should be replicated when including it in the model’s training set.
  • Bias: include or exclude the intercept in the Linear Regression formula. Including it yields better results in most cases. It is active by default.
  • Field codings: select the encoding option that works best for your categorical fields. BigML will automatically transform your categorical values into 0 -1 variables to support non-numeric fields as inputs, which is a method known as dummy encoding. Alternatively, you can choose from two other types of codings: contrast coding or other coding. You can find a detailed explanation of each one in the documentation.
  • Sampling options: if you have a very large dataset, you may not need all the instances to create the model. BigML allows you to easily sample your dataset at the model creation time.

In terms of performance, the focus for Linear Regression is whether to include or exclude the bias term. All other parameters also depend on the data, the domain and the use case you are trying to solve. It’s natural that you want to understand the strengths and weaknesses of your model and iterate trying different features and configurations. To do this, the model visualizations explained in the next point would be very helpful.

4. Analyze your Results

When your Linear Regression has been created you can use BigML’s insightful visualizations to dive into the model results and see the impact of your features on model predictions.

BigML provides a 1D chart, a partial dependence plot (PDP) and a coefficient table to analyze your results.

1D Chart and PDP

Both 1D chart and PDP provide visual ways to analyze the impact of one or more fields on predictions.

For the 1D chart, you can select one numeric input field in the x-axis. In the prediction legend to the right, you will see the objective field predictions as you mouse over the chart area. The chart can also show the 95% prediction interval band in blue. This means, for any given point on the x-axis, its y value will be within this blue range with 95% probability. You can choose to show or hide the interval band.


For the PDP, you can select two input fields, either numeric or categorical, one per axis and the objective field predictions will be plotted in the color heat map chart.


By setting the values for the rest of the input fields using the form below the prediction legend, you will be able to inspect the combined interaction of multiple fields on predictions.

Coefficients table

BigML also provides a table to display the coefficients learned by the Linear Regression. A positive coefficient indicates a positive correlation between the input field and the objective field, while a negative coefficient indicates a negative relationship.


5. Evaluate the Linear Regression

Like any supervised learning method, Linear Regression needs to be evaluated. Just click on the evaluate option in the 1-click menu and BigML will automatically select the remaining 20% of the dataset that you set aside for testing.


The resulting performance metrics to be analyzed are the same ones as for any other regression models predicting a continuous value.

You will get three regression measures in the green boxed histograms: Mean Absolute Error, Mean Squared Error and R Squared. By default, BigML also provides the measures of two other types of models to compare against your model performance. One of them uses the mean as its prediction and the other predicts a random value in the range of the objective field. At the very least, you would expect your model to outperform these weaker benchmarks. You can choose to hide either or both of the benchmarks.


For a full description of the regression measures see the corresponding documentation.

6. Make Predictions

In BigML, you can make predictions for a new single instance or multiple instances in batches.

Single predictions

Click in the Predict option and set the values for your input fields.


A form containing all your input fields will be displayed and you will be able to set the values for a new instance. At the top of the view, you will see the objective field prediction changing as you change your input field values.


Batch predictions

Use the Batch Prediction option in the 1-click menu and select the dataset containing the instances for which you want to know the objective field value.


You can configure several parameters of your batch prediction such as the option to include both confidence interval and prediction interval in the batch prediction output dataset and file. When your batch prediction finishes you will be able to download the CSV file and see the output dataset.


If you want to learn more about Linear Regression please visit our release page for documentation on how to use Linear Regression with the BigML Dashboard and the BigML API.


Bigger Results from Smaller Data with Linear Regression

In this second post of our series, we’ll cover a use case example. Least squares linear regression is one of the canonical algorithms in the statistical literature. Part of the reason for this is that it’s very good as a pedagogical tool. It’s very easy to visualize, especially in two dimensions, a line going through a set of points, and the distances from the line to each point representing the error of the classifier.  Kind people have even created nice animations to help you:

GeoGebra Linear Regression

And herein, we have machine learning itself in a nutshell. We have the training data (the points), the model (the line), the objective (the distances from point to line) and the process of optimizing the model against the data (changing the parameters of the line so that those distances are small).

It’s all very tidy and relatively easy to understand . . . and then comes the day of the Laodiceans, when you realize that not every function is a linear combination of the input variables, and you must learn about gradient trees and deep neural networks, and your innocence is shattered forever.

Part of the reason to prefer these more complex classifiers to simpler ones like linear regression (and its sister technique for classification, logistic regression) is that they are often generalizations of the simpler techniques. You can, in fact, view linear regression as a certain type of neural network; specifically one with no hidden layers, non-linear activation functions, or fancy things like convolution and recurrence. 

So, then, what’s the use in turning back to linear regression, if we already have other techniques that do the same and more besides?  Answers to this question often come in two flavors:

  1. You need speed.  Fitting a neural network can take a long time whereas fitting a linear regression is near-instantaneous even for medium-sized datasets. Similarly for prediction:  A simple linear model will, in general, be orders of magnitude faster to predict than a deep neural network of even moderate complexity. Faster fits let you iterate more and focus on feature engineering.
  2. You have small training data.  When you don’t have much training data, overfitting becomes more of a concern; complex classifiers may fit the data a bit better, but if you have small training data, your test sets are very small and so your estimates of goodness of fit become unreliable. Using models like linear regression reduces your risk of overfitting simply by giving you less variables to fit.

We’re going to focus on the second case for the rest of this blog post. One way of looking at this is the classic view in machine learning theory that the more parameters your model has, the more data you need to fit those properly.  This is a good and useful view. However, I find it just as useful to think about this from the opposite direction: We can use restrictive modeling assumptions as a sort of “stand in” for the training data we don’t have.

Consider the problem of using a decision tree to fit a line. We’ll usually end up with a sort of “staircase approximation” to the line. The more data we have, the tighter the staircase will fit the line, but we can’t escape the fact that each “step” in the staircase requires us to have at least one data point sitting on it, and we’ll never get a perfect fit.

This is unfortunate, but the upside is loads of flexibility. Decision trees don’t care a lick whether the underlying objective is linear or not; you can do the same sort of staircase approximation to fit any function at all.

Using linear regression allows us to sacrifice flexibility to get a better fit from less data.  Consider again the same line. How many points does it take from that line for linear regression to get a perfect fit?  Two. The minimum error line is the one and only line that travels through both points, which is precisely the line you’re looking for. No, we can’t fit all or even most functions with linear regressions, but if we restrict ourselves to lines, we can find the best fit with very little data.

Linear Regression: More Power

Some of you may find the reasoning implied above to be a bit circular: “You can learn a very good model using linear regression, provided that you know in advance that a line is a good fit to the data.” It’s a fair point, but it can be surprising how often you find that this logic applies. It’s not odd to have a set of features, where changes in those features induce directly proportional changes in the objective, simply because those are the sorts of features amenable to machine learning in general. And in fact, these sorts of relationships abound, especially in the sciences, where linear and quadratic equations go much of the way towards predicting what happens in the natural world.

As an example, here’s a dataset of buildings that has measurements of the roof surface area and wall surface area for all of them, and the heat loss in BTUs for the building. Physics tells us that heat loss is proportional to these areas, and you break them out into roof and wall surface areas because those things are insulated differently. The dataset also has only 12 buildings, so we’ll use nine for training and three for test. Is it possible to get a reasonable model using so little data?


If we try a vanilla ensemble of 10 trees, we get an r-squared on the holdout set of 0.85. This isn’t bad, all things considered! Again, we’ve only got nine training points, so something so well-correlated to the objective is pretty impressive.

trees.png Now, let’s see if we can do better by making linear assumptions. After all, we said at the top that heat loss is, in fact, proportional to the given surface areas. Lo and behold, linear models serve us well: We are able to recover the “true” model for heat loss through a surface near-exactly.


One caveat here is that you’re only evaluating on three points, so it’s hard to know if the performance difference we see is significant. We might want to try cross-validation to see if the results continue to hold. However, Occam’s razor principle implores to choose the simpler model even if their performances are equal, and prediction will be faster to boot.

Old Wine in New Bottles

Yes, linear regression is somewhat old-fashioned, and in this day and age where datasets are getting larger all the time, the use cases aren’t as many as they used to be.  We make a mistake, though, to equate “fewer” with “none”.  When you’ve got small data and linear phenomena, linear regression is still queen of the castle.

Introduction to Linear Regression

BigML’s upcoming release on Thursday, March 21, 2019, will be presenting our latest resource to the platform: Linear Regressions. In this post, we’ll do a quick introduction to General Linear Models before we move on to the remainder of our series of 6 blog posts (including this one) to give you a detailed perspective of what’s behind the new capabilities. Today’s post explains the basic concepts that will be followed by an example use case. Then, there will be three more blog posts focused on how to use Linear Regression through the BigML DashboardAPI, and WhizzML for automation. Finally, we will complete this series of posts with a technical view of how Linear Regressions work behind the scenes.

Introduction to Linear Regression

Understanding Linear Regressions

Linear Regression is a supervised Machine Learning technique that can be used to solve, you guessed it, regression problems. Learning a linear regression model involves estimating the coefficients values for independent input fields that together with the intercept (or bias), determine the value of the target or objective field. A positive coefficient (b_i > 0), indicates a positive correlation between the input field and the objective field, while negative coefficients (b_i < 0) indicate a negative correlation. Higher absolute coefficient values for a given field can be interpreted to have a greater influence on final predictions.

By definition, the input fields (x_1, x_2, …, x_n) in the linear regression formula need to be numeric values. However, BigML linear regressions can support any type of fields by applying a set of transformations to categorical, text, and items fields. Moreover, BigML can also handle missing values for any type of field.

It’s perhaps fair to say linear regression is the granddaddy of statistical techniques that is required reading for any Machine Learning 101 student as it’s considered a fundamental supervised learning technique. Its strength is in its simplicity, which also implies it is pretty easy to interpret vs. most other algorithms. As such, it makes for a nice quick and dirty baseline regression model similar to Logistic Regression for classification problems. However, it is also important to grasp those situations linear regression may not be the best fit, despite its simplicity and explainability:

  • It works best when the features involved are independent or to put it another way less correlated with one another.
  • The method is also known to be fairly sensitive to outliers.  A single data point far from the mean values can end up significantly affecting the slope of your regression line, in turn, hurting the models chances to better generalize come prediction time.

Of course, using the standardized Machine Learning resources on the BigML platform, you can mitigate these issues and get more mileage from the subsequent iterations of your Linear Regressions. For instance,

  • If you have many columns in your dataset (aka a wide dataset) you can use Principal component analysis (PCA) to transform such a dataset in order to obtain uncorrelated features.
  • Or, by using BigML Anomalies, you can easily identify and remove the few outliers skewing your linear regression to arrive at a more acceptable regression line.

Here’s where you can find the Linear Regression capability on the BigML Dashboard: Linear Regression on Dashboard

Want to know more about Linear Regressions?

If you would like to learn more about Linear Regressions and find out how to apply it via the BigML Dashboard, API or WhizzML please stay tuned for the rest of this series of blogs posts to be published in the next week.

Linear Regression Joins the Suite of Supervised Methods on BigML

The latest BigML release brings a tried and true Machine Learning algorithm to the platform: Linear Regression. We intend to make it generally available on Thursday, March 21, 2019. This simple technique is well understood and widely used across industries. As such, it has been a frequently requested algorithm by our customers and we are happy to add it to our collection of supervised learning methods.

As the name implies, this algorithm assumes a linear relationship between the input fields and the output (objective) field, which enables you to discover relationships between quantitative, continuous variables. Since BigML has advanced data transformation capabilities, our implementation of linear regression can support any type of field including categorical, text, and items fields, and can even handle missing values. To give a sense of how Linear Regression is applied out in the real world, it’s often used to analyze product performance, conduct market research, perform sales forecasting, and make stock market predictions, among many other use cases.

One of the main benefits of Linear Regression is its simplicity, which affords a high level of interpretability. This makes it a good technique for doing quick tests and model iterations to establish a baseline to solve regression problems. Like any other technique, there are tradeoffs so there will be circumstances where Linear Regression is not a suitable model for your uses case. We will explain some of those considerations in more detail in our subsequent posts.

As usual, this release comes with a series of blog posts that progressively explain Linear Regression through a real use case and brief tutorials on how to apply it via the BigML Dashboard, API, WhizzML and bindings. While we will not be having a live webinar for this release, feel free to contact us at with any questions or feedback as always.

Want to know more about Linear Regression?

If you are curious to learn more about how to apply Linear Regression using the BigML platform, please stay tuned for the rest of this series of blogs posts to be published over the next week.

Seville becomes the capital of innovation with the first Machine Learning School in Andalusia

184 decision makers, analysts, domain experts, and entrepreneurs coming from all around the world gathered on March 7 and 8 at EOI Andalucía to join the first edition of our Machine Learning School held in Seville, Spain (#MLSEV). The event was co-organized by EOI and BigML, in collaboration with the Andalusian Government and Seville City Council; and sponsored by La Caseta, qosIT Consulting, and ITlligent

Attendees came from 13 countries (Andorra, Brazil, China, Denmark, India, Ireland, Italy, Lebanon, the Netherlands, Portugal, Spain, United Kingdom, and United States) to enjoy the two-day event that offered several master classes along with workshops to put into practice the concepts learned in them. Also presented were eight real-world use cases showing how big and small organizations are already applying Machine Learning, such as Rabobank, TDK, T2Client, Talento Corporativo, SlicingDice, Jidoka, Good Rebels, and AlterWork in areas like banking, industry, marketing, the legal sector, among others.

165 attendees represented 92 private companies and big corporations, which highlights that companies are ready to adopt Machine Learning to work more efficiently. The remaining 19 attendees represented 10 universities and other educational institutions. As usual, international networking was an important advantage for attendees, who also had the chance to discuss their Machine Learning projects with the BigML Team at the Genius Bar.

During the opening and closing remarks, we were honored to have the collaboration of the Government of Andalusia. Manuel Ortigosa, Secretary-General of companies, innovation, and entrepreneurship; and Manuel Alejandro Hidalgo, Secretary-General of Economy, presented MLSEV as a great opportunity for the region to bring new ideas and innovative companies to Andalusia in order to help tech business grow in the south of Spain. The Government of Spain was also represented by Raúl Blanco, Secretary-General of Industry and small and middle size companies in the Ministry of Industry, Commerce, and Tourism. Additionally, Francisco Velasco, Director of EOI Business School Andalusia, also opened and closed the event emphasizing the importance of celebrating such a Machine Learning crash course at EOI Andalusia.

Juan Ignacio de Arcos, MLSEV Chairman, Business Analytics Executive Programme’s Director at EOI, and BigML Strategic Advisor, closed the event with a special mention to the companies attending the event, who realized applying Machine Learning is a strategic decision to be ahead of their competitors. 

For more information about the program, speakers, and other details, please visit the event page here. Or check the event photos here. Stay tuned for more Machine Learning event announcements, as there are more editions to come in Seville and other cities worldwide!

2019 Oscars Predictions: Results Are In

The 91st Academy Awards this Sunday, the first without a host in 30 years, proceeded without a hitch and seemed to sit well with the worldwide audience. For the third year in a row, we applied the BigML Machine Learning platform to predict the winners. This year, we got 4 out of 8 right for the major award categories. While this may seem mediocre, it’s notable that the confidence scores for the most likely nominee to win for 3 out of the 8 categories were well below 50%, meaning those were virtual coin toss type categories with multiple weak favorites going up against each other. Lo and behold, we whiffed on all three weak favorites: Best Picture, Best Supporting Actress and Best Original Screenplay.

2019 Oscars Predictions Results BigML

At this stage, we can merely speculate the reasons behind the Academy members’ votes, but we can peek behind the curtain to understand how our Machine Learning models made their predictions. So, let’s dive in! Our results are shown in the table below. For two of the missed categories, the actual winners were our second choice, and Green Book, the winner of Best Picture was a close tie as our number 3 pick.

BigML Oscars 2019 Predictions results

This year we relied on two new tools added to our toolbox that can be game-changers when it comes to improving accuracy and saving time in your Machine Learning (ML) workflows. The first method involved OptiML (an optimization process for model selection and parameterization) which is both robust and incredibly easy-to-use on BigML. Once we had collected and prepared the datasets, which is often the most challenging part of any ML project, all we had to do was hover over the “cloud action” menu and click OptiML. Really, that’s it!

BigML OptiML 1-Click

After running for about an hour, the OptiML returns a list of top models for us to inspect further and apply our domain knowledge. In that relatively short amount of time, the OptiML processed over 700 MB of data, created nearly 800 resources and evaluated almost 400 models. How about that?!

BigML OptiML Best Director results

Next, we took the list of selected models (the top performing 50% out of the total model candidates from OptiML) and built a Fusion, which combines multiple supervised learning models and aggregates their predictions The idea behind this technique is to balance out the individual weaknesses of single models, which can lead to better performance than any one method in particular (but not always, see this post for more details). The screenshot below shows the Fusion model for the Best Director category, which was comprised of 13 decision tree, 45 ensembles, 41 logistic regressions and 2 deepnets. The combined predictions of all those models contributed to our pick of Alfonso Cuarón, director of Roma, to take home the prize.

BigML Fusion for Best Director Oscars 2019

Have we really done the best Machine Learning can do? Is there a reason to believe that OptiML may not have found the best solution to this problem? My colleague, Charles Parker, BigML’s VP of Machine Learning Algorithms, chimes in with an explanation of how things get a little hazy here: Remember, OptiML is essentially doing model selection by estimating performance on multiple held out samples of the data. Since our Oscar data only goes back about 20 years, the number of positive examples in each held out test set is just a fraction of those 20 or so examples. Our estimation of the performance of each model in the OptiML will then be driven primarily by just a tiny number of movies. Indeed, if we mouse over standard deviation icon next to the model’s performance estimate in the OptiML (see screenshot below), we’ll see that the standard deviation of the estimate is so large that the performance numbers of nearly all of the models returned are within one standard deviation of the top model’s performance.  

BigML Fusion for Best Picture Oscars prediction

What does this mean?  For one thing, it means that you don’t have enough data to test these models thoroughly enough to tell them apart. Thankfully, OptiML does enough training-testing splits to show us this, so we don’t make the mistake of thinking that the very best model is meaningfully better than most other models in the list.  

Unfortunately, this is a mistake that is made all too often by people using Machine Learning in the wild. There are many, many cases in which, if you try enough models, you’ll get very good results on a single training and testing split, or even a single run of cross-validation. This is a version of the multiple comparisons problem; if you try enough things, you’re bound to find something that “works” on your single split just by random chance, but won’t do well on real world data. As you try more and more things, the tests you should use to determine whether one thing is “really better” than another need to be stricter and stricter, or you risk falling into one of these random chances, a form of overfitting.

In OptiML’s case, the easiest and most robust way to get a stricter test is to seek out more testing data. But we can’t time travel (yet!), and so we’re stuck with the data we have. The upshot of all of this is that, yes, there may very well be a better model out there, but with the data that we have, it will be difficult to say for sure that we’ve arrived at something clearly better than everything that OptiML has tried.

As it turned out, BigML was not alone in missing the mark for the top category predictions. DataRobot was counting on Roma to win Best Picture, and Green Book was not in their top three. Microsoft Bing and TIME also put their bets on Roma, so it goes to show you the reality of algorithmic predictions being tested in real world scenarios where patterns and “rules” don’t always apply.

Alright, alright, enough of the serious talk. As pioneers of MLaaS here at BigML, we care deeply about these matters concerning the quality and application of ML-powered findings so we couldn’t pass this chance to discuss. But back to red carpet results…we enjoyed the challenge of once again putting our ML platform to the test of predicting the most prestigious award show in the entertainment industry. To all users who experimented with making their own models to predict the Oscars, let us know how your results came out on Twitter @bigmlcom or shoot us a note at

%d bloggers like this: