Skip to content

New Features for the BigML Predict App for Zapier

by

Thanks to the feedback provided by early adopters, our BigML app for Zapier has been acquiring new useful features, including improved support for additional ML algorithms and dynamic resource selection.

Support for new ML algorithms

The first version of our BigML Predict app only included support for model, ensembles, and logistic regressions. Now, you can also use clusters and anomaly detectors for your predictions and execute a WhizzML script, which can be great to automate more advanced use cases.

When you try to add an action from the BigML Predict app, you will now see a longer list of choices as shown below:

img-3

  • Create Prediction (legacy): if you haven’t used the BigML Predict app before, you can safely ignore the “Create Prediction (legacy)” action.
  • Execute WhizzML script: this action allows you to run a WhizzML script with a given set of arguments. Due to the way Zapier requires users to specify input fields, you will only be able to run WhizzML scripts that take “scalar” arguments.
  • Create Anomaly Score: computes the anomaly score associated with a data instance by using an anomaly detector.
  • Create Centroid: identifies the cluster that is closer to your input data instance.
  • Create Ensemble Prediction: uses an ensemble to make a prediction.
  • Create Prediction: uses a model or logistic regression to make a prediction.

When you include one of the actions listed above into your workflow, you will be given the chance to specify a few input fields:

  • the Resource ID: a simple value of the form resource-type/resource-id, e.g., ensemble/123456. The resource must exist in you BigML account, otherwise, the workflow execution will fail. You can either hard-code this value or use the “Find a
    resource” option to select the proper resource dynamically based on a number of criteria. This will be further detailed below.
  • Input Data: a list of values to be used as a prediction input. For each of them, you specify both the feature name and its actual value.
  • Additional input arguments: to specify how the prediction should be calculated. Allowed arguments will vary with the prediction algorithm you choose. For example, an ensemble prediction allows you to specify how to handle missing values, as well as what kind of combiner to use, etc.

img-4

Dynamically selecting ML resources

If you joined our beta program, chances are you’ve noticed that the biggest new feature in our BigML Predict app is the “Find a resource” search option. It simply lets you specify a number of search criteria to identify a resource to use for predictions.

img-1

This means you can, for example, specify a project name and a resource type to identify the most recent resource of that type belonging to the specified project. The result of this operation is a Resource ID that you can use in any subsequent step of your Zapier workflow to further manipulate that resource. The image below displays all the search criteria you can use.

img-2

At a bare minimum, to search for a resource, you should provide its type, e.g., anomaly detector, ensemble, etc. If you only specify a resource type, the “Find a resource” search will select your latest resource of that type among all of your resources. Alternatively, you can make your search more specific by also providing any of the following information:

  • Resource Name: the name of the resource you would like to select or a part of it.
  • Resource Name: a tag associated with the required resource.
  • Project Name: the name of the project your resource should belong to. You can also specify only a part of the project name.
  • Resource Name: a tag associated with the project your resource should belong to.
  • Mode: either Production or Development mode. If you don’t specify anything, the “Find a resource” search will look into your production resources by default.

If your search criteria aren’t specific enough to identify just one resource, the most recent one will be used.

To effectively include a search step in one of your Zapier workflows, you should link the Resource ID field of your, e.g., Create Prediction action to the output provided by the search step, as shown in the picture below.

img-5

Get access to the improved BigML Predict app

We hope that the new features we added to BigML Predict for Zapier will help you better use Zapier to solve your ML automation problems.

If you are interested in giving the new BigML Predict app a try, please get in touch with us at support@bigml.com.

A Stupidly Easy Speed Detector

My family and I recently moved into a new house in the center of our little college town. We love it, but our new location also puts us next to a busy residential street. All too often passersby would tear through well above the posted 25 mph limit. It brought out my latent grumpy side.

Being a data-oriented guy, I wanted to have some hard numbers for how many folks were speeding and I wasn’t going to spend hundreds of dollars on a radar system. So instead I threw together a web camera, some simple video processing, and anomaly detection to make a system for tracking vehicle speeds. The diagram below shows the process from a high level, but I’ll also dive into a few of the details.

workflow_v03

I’ve previously toyed with combining videos and BigML’s anomaly detection (an extended variety of isolation forests) as a way to do motion detection. By tiling a video and building an anomaly detector for each tile, I made a motion detector that could disregard common movement. For example, in this video the per-tile detectors don’t trigger for the oscillating fan, but they do for the nefarious panda interloper (on loan from my daughter).

To do this I extracted features for each tile, such as the average red, green, and blue pixel values (shout-out to OpenIMAJ for making this easy). This gave me tile-specific datasets where each row represents a video frame, like so:

Once collecting data in a row/column form, it’s easy to create an anomaly detector with the BigML language bindings. I do my hobby projects in Clojure, so the BigML Clojure bindings let me transform the data above into an anomaly detection function with only this snippet of code:

The tiny example above loads the data from the previous gist, builds an anomaly detector as a Clojure function, and uses that function to score a new point. Scores from BigML anomaly detectors are always between 0 and 1. The stranger a point, the larger the score. Generally scores greater than 0.6 are the interesting ones. The green highlighted tiles in the panda example represent scores above 0.65.

So I took this tiling+detectors approach and applied it to video of cars passing my house. My intuition was that while tracking cars can be tricky, learning the regular background should be easy. Then all I’d need is to track the clumps of anomalies which represented the cars.

Instead of tiling the entire video, I only tiled two rows. Each row captured a vehicle lane. I tracked the clumps of anomalies and timed how long it took them to sweep across the video. Those times let me estimate vehicle speeds.

By tracking clumps of anomalies the system is more robust to occasional misfires by individual tile detectors. Also, as expected, the detectors helped ignore common motion like the small tree swaying in the foreground.

An approach like this is far from perfect. It can be confused by things like lane changes, bicycles, or tall trucks (which can register as two vehicles or occlude other cars).

truck

Nonetheless, I was pleasantly surprised how well it did given the simplicity. With occasional retraining of the detectors, it also handled shadows and shifting lighting conditions. In some cases it tracked vehicles even when I had a hard time finding them. There is, believe it or not, a car in this image:

shadow

So I had a passable vehicle-counter/speed-detector using a webcam. To culminate the project, I collected vehicle speeds over a typical Saturday afternoon. The results surprised me.

speed-bars

I expected speeders to be much more common than they actually were. In fact, the significant speeders (which I deemed as 30+ mph) made up only about 3% of the total. So I’ve done my best to lose the grumpiness. Without the data, I’d just be one more victim of confirmation bias.

For the Clojure-friendly and curious, feel free to check out the project on GitHub.

How to create a WhizzML script – Part 2

In this second post about WhizzML basics, we go deeper into script creation methods. In the previous post, How to create a WhizzML script – Part1, you learned the basic concepts of WhizzML and how to clone existing scripts. In this tutorial, we introduce how to create and edit WhizzML scripts via the Web REPL. Let’s dive in!

WhizzMLLoop

Write your own scripts

Start by selecting “Scripts” under the WhizzML menu.

7

Once you are in the scripts section, you have the options to create a new script by using the editor or import one from GitHub (explained in the next post).

Create a script from scratch with the editor

8

With this option, you can write the code directly in the embedded editor, which provides syntax highlighting and autocomplete capabilities.

On top of the defined directives of the language, you’ll be able to use any procedure defined in a previously created library provided that you import that library first.

Once you have written your new WhizzML script, you should validate it to make sure your code is written with the correct WhizzML syntax. When the script code is validated, our REPL editor will automatically extract the inputs and outputs involved and present them for you to define their types. Errors, if any, will also be highlighted.

The simple WhizzML code in the image gallery below takes a number x as input and returns x+2. So the output (“result”) is also a number. In general, inputs and outputs have many possible types ranging from the basic string, number, boolean, list or map types to any of the resources in BigML, like sources, datasets, models or even scripts and executions. Once the code of your script is validated, you can actually create the script. If you execute it by providing an input of 2, you’ll observe that the output equals 4 as expected. Keep in mind that you can execute a script as many times as you wish in a repeatable and traceable way.

Create a script from an existing one

Imagine you have already executed a script, but you want to slightly modify it. Or you just have an old one you’d like to improve on. In these cases, you can easily create a new script from an existing one. Just navigate to the script you’d like to use and click on the “Create a new script using this one” menu option. Pretty self-explanatory, ha? This should save you a lot of time.

scripteditor

 Web REPL

One way to test WhizzML code is to use the WhizzML REPL. This feature can be found in the BigML LABS section that contains cool new functionality for our users to test before they get queued up to be fully integrated into the Dashboard as part of later releases. The WhizzML REPL is a simple, interactive programming environment that takes single expressions, evaluates them, and returns the result. When you open it, you’ll see a window to write code and a console to run it too. So you can edit your code to your heart’s desire, dynamically run and test it before implementing it in a finished script.

Now you know how to create a WhizzML script in different ways, but there are still more to learn. The next step is discovering how to use GitHub and BigMLer to create scripts and libraries all of which will be covered in our next post.  Stay tuned!

How to create a WhizzML script – Part 1

WhizzML is a Domain-Specific Language (DSL) developed by BigML. It is a powerful tool for automating Machine Learning (ML) workflows and implementing high-level ML algorithms. In this series of blog posts, you will learn WhizzML from scratch. In this post, we’ll explain where to find WhizzML scripts and how you can use them. Let’s start!

1

What is a script?

First, we’ll remind you of some definitions to clarify some important WhizzML concepts.

A workflow is the series of activities that are necessary to solve a ML problem by using the resources provided by BigML.

A script is a workflow specification that uses WhizzML source code.

A library is a shared collection of WhizzML functions and definitions usable by any script that imports them.

An execution is a specific run of a script, i.e., the actual realization of all the steps codified by its WhizzML source.

You may be wondering what’s inside a WhizzML script and how it’s structured. Let’s discover that next. WhizzML scripts have four components:

  • Source: the original script, the code creating the WhizzML script itself
  • Imports: the list of libraries with the code used by the script
  • Inputs: the list of input values that parameterize the workflow
  • Outputs: the list of values computed by the script and returned to the user

3

How to create a WhizzML script?

There are different ways to create a WhizzML script depending on what tools you use. BigML offers several options:

  • Gallery
  • Scriptify
  • Script editor
  • Github
  • BigMLer

We’ll focus on the Gallery and the Scriptify methods in this blog post. You’ll learn how to use the other methods in future blog posts. BigML provides a public Gallery with scripts, datasets and models for your benefit. By choosing a resource and cloning it, you will have it in your Dashboard, where you can run it or create an execution. The Gallery is great to get inspired by high-level Workflows from other users. It can also save you time by eliminating the need to recreate basic scripts from scratch.

Scriptify is another convenient tool for creating scripts. With Scriptify, you can create a WhizzML script for any Machine Learning workflow in a single click. All BigML resources (datasets, models, evaluations, etc.) have a menu option named Scriptify that allows you to automatically generate a script that can regenerate a resource (or generate a similar one if it is used with different inputs but with the same field structure). Here is an example of Scriptify appliable to a Model resource.

2_Scriptify

What about the Execution?

Executing a WhizzML script is easy. Given a script and a complete set of inputs, you can execute your workflow and watch it generate the output. There are also different tools you can use to do just that:

  • Web UI
  • BigMLer
  • Bindings

Let’s focus on the first method and leave the others for later blog posts. When you’re in the Dashboard you can go to the WhizzML menu, where you will find your scripts, libraries and executions.

7

Let’s use the script we just imported from the Gallery as an example of an execution. In this script, we just have to choose one input before we execute it. Once it finishes, we will find it under “Executions”. You can then find the output information, seen below with the name “result“. You can also see the resources that have been created to execute this script under Resources section. You can execute a script as many times as you want in a repeatable and traceable way.

How to add a script to the resources menu?

You can easily add a script to your resources menu. This lets you use your favorite scripts as many times as you want. Simply follow these steps:

  1. When you’re viewing a script, find the option “Add a Script to this Menu” on the top right.4_Menu1
  2. Choose the resource this script is for, e.g., Datasets, Models, Ensembles, Evaluations
  3. Once added, you can find it in the script menu of that resource view. For example, if I add a script for datasets, I will be able to use it in the Dataset tab. The ability to use the same script many times helps you get more productive.4_Menu5

Now you know the basics of creating a WhizzML script, but this is just the beginning. In the next post, we’ll discover other ways to create and use scripts.

BigML Spring 2017 Release Webinar Video is Here!

We are happy to share that Time Series, BigML’s latest resource is already fully implemented on the BigML platform and available from the BigML Dashboard, API, as well as from WhizzML for its automation. Special thanks to all webinar attendees who joined the BigML Team yesterday during the official launch. As usual, your feedback and questions are very much appreciated and help us keep improving BigML every day!

Time Series is a sequentially indexed representation of your historical data that helps you forecast future values of numerical properties. This method is commonly used for predicting stock prices, sales forecasting, website traffic, production and inventory analysis, and weather forecasting, among many other use cases.

Don’t fret if you missed the live webinar. It is now available on the BigML YouTube channel so you can watch it as many times as you wish.

Please visit our dedicated Spring 2017 Release page for further reading. The learning resources available include:

  • The slides used during the webinar.
  • The Time Series documentation to learn how to create and evaluate your Time Series, and interpret the results before making forecasts from the BigML Dashboard and the BigML API.
  • The series of six blog posts that explain Time Series starting with the basics and progressively diving deeper into the technical and practical aspects of this new resource, with an emphasis on Time Series models for forecasting.

Thanks for your positive comments after the webinar. And remember that you can always reach out to us at support@bigml.com for any suggestions or questions you may have.

Behind the Scenes of BigML’s Time Series Forecasting

BigML’s Time Series Forecasting model uses Exponential Smoothing under hood. This blog post, the last one of our series of six about Time Series, will explore the technical details of Exponential Smoothing models, to help you gain insights about your forecasting results.

Exponential Smoothing Explained

To understand Exponential Smoothing, let’s first focus on the smoothing part of that term. Consider the following series, depicting the closing share price of EBAY over a 400 day period.

ebay

There is definitely some shape here, which can help us tell the story of this particular stock symbol. However, there are also quite a few transient fluctuations which are not necessarily of interest. One way to address this is to run a moving average filter over the data.  ebay-ma

The output of the moving average (MA) filter is shown as the blue line. At each time index, we compute the filtered data point as the arithmetic mean of the unfiltered data points located within a window of fixed width m about that time index. Given time series data y, a (symmetric) moving average filter produces the filtered series:

ql_a9573b73272fff76f75a8993c010a7a5_l3

As seen in the figure, the resulting filtered time series contains only the large scale movements in the stock price, and so we have successfully smoothed the noise away from the signal.

When we apply Exponential Smoothing to a time series, we are performing an operation that is somewhat similar to the moving average filter. The exponential smoothing filter produces the following series:

\ell_t = \alpha y_t + (1 - \alpha)\ell_{t-1}

Where 0 < \alpha < 1 is the smoothing coefficient. In other words, the smoothed value l is the \alpha-weighted average between the current data point and the previous smoothed value. If we substitute the value for  \ell_{t-1} , we can rewrite the exponential smoothing expression like so:

level-smoothing

Where \ell_0 is the initial smoothed state value. Here, we see that the exponentially smoothed value is a weighted sum of the original data points, just as with the MA filter. However, whereas the MA filter computes a uniformly-weighted sum over a window of constant width, the exponential smoother computes the sum going all the way back to the beginning of the series. Also the weights are highest for the points closest to the current time index, and decrease exponentially going back in time. To verify that this produces a smoothing effect, we can apply it to our EBAY data and look at the results.

ebay-ses

Why would we choose to smooth a time series using an exponential window instead of moving average? Conceptually, the exponential window is attractive because it allows the filter to emphasize a point’s immediate neighborhood without completely discarding the time series’ history. The fact that the parameter \alpha is continuously-valued means that there is more freedom to fine-tune the smoother’s fit to the data, compared to the moving average filter’s integer-value parameter.

Now, the other half of time series modeling is creating forecasts. Both the moving average and exponential smoother have a flat forecast function. That is, for any horizon h beyond the final data point, the forecast is just the last smoothed value computed by the filter.

\hat{y}_{t+h|t} = \ell_t

This is admittedly quite a simplistic result, but for stationary time series, these forecast values can be usable for reasonably short horizons. In order to forecast time series which exhibit more interesting movement, we need to incorporate trend into our model.

Trend models

In the previous section we smoothed a time series using a single pass of an exponential window filter, resulting in a “level-only” model which produces flat forecasts. To introduce some motion into our exponential smoothing forecasts we can add a trend component to our model.  We will define trend as the change between two consecutive level values \ell_{t-1} and \ell_t, and then interpret this purposefully vague definition in two ways:

  1.  The difference between consecutive level values (additive trend):                                                   r_t=\ell_t-\ell_{t-1}
  2.  The ratio between consecutive level values (multiplicative trend):                                                   r_t=\ell_t/\ell_{t-1}

We can then perform exponential smoothing on this trend value, in an identical fashion to the level value:

b_t=\beta r_t + (1-\beta)b_{t-1}

Where  0 < \beta < 1 is the trend smoothing coefficient.  This combination of exponential smoothing for level and trend is frequently referred to as Holt’s linear or exponential trend method, after the author who first described it in 1957. The forecast for a given horizon h from an exponential smoothing model with trend is simply the most recent level value, with the smoothed trend applied h times. That is,

y_{t+h|t} = \ell_t +hb_t \quad \textrm{or} \quad y_{t+h|t} =  \ell_t b_t^h

Hence, for additive trend models, the forecast is a straight line, and for multiplicative trend models, the forecast is an exponential curve. For some cases, it may be undesirable for the trend to continue at a constant value as the forecast horizon grows. We can introduce a damping coefficient 0 < \phi < 1 and reformulate the smoothing equations. The forecast, level, and trend equations for a damped additive trend model are:

ql_dd4bbc9f49548f4317d52b35400da685_l3

and for multiplicative trend:
ql_556f27bfade9aa8a8ce2c492113d3c10_l3

Seasonal models

Many time series exhibit seasonality, that is, a pattern of variation that takes over consecutive periods of fixed length. For example, alcohol sales may be higher during the summer than the winter, year after year, so a time series containing monthly sales figures of beer could exhibit a seasonal pattern with a period of m=12. Once again, seasonality can be modeled additively or multiplicatively. In the case of the former, the seasonal variation is independent of the level of series whereas in the latter, the variation is modeled as a proportion of the current level.

To bring it all together, the following is an example of a time series which exhibits both trend and seasonality.

ausbeer-mam

Note how the level is a smoothed version of the observed data, and the trend (labeled “slope”) is more or less the rate of change in the level.

Learning exponential smoothing models

Exponential smoothing models are fully specified by their smoothing coefficients α,β,γ , and φ along with initial state values l0, b0, and s0 (the remaining state values are obtained by running the smoothing equations forward). To evaluate how well an exponential smoothing model fits the data, we compute what is called the “within-sample one step ahead forecast error”. Put plainly, for each time step t, we compute the forecast for one step ahead, and calculate the error between the forecast and the actual data from the next time step.

et

We compute these errors for each time step where we have observed data available, and the sum of squared errors is our metric for model fit. This metric is then used to perform numeric optimization in order to obtain the best values for the smoothing coefficients and initial state values. BigML uses the Nelder-Mead simplex algorithm as its optimization solution.

Model Selection

Considering all the different combinations of trend and seasonality types for exponential smoothing can mean that we must choose among over a dozen different model configurations for a time series modeling task. Therefore, we need some way to rank the models against each other. Naturally, the ranking should incorporate a measure of how well the model fits the training time series, but should also help us avoid models which overfit the data. The tool that fits these requirements is the Akaike Information Criterion (AIC). Let \hat{L} be maximum likelihood value of the model, computed from the sum of squared errors between the model fit and the true training values. Let k be the total number of parameters required by the model type. For example an A,Ad,A model with seasonality of 4 uses 10 parameters: 4 smoothing coefficients and 6 initial state values (one level, one trend, and 4 seasonality). The AIC is defined by the following difference.

aic

Models which produce lower AIC values are considered better choices, so the best model is the one which maximizes the likelihood L, while minimizing the number of parameters k. Along with the AIC, BigML also computes two additional metrics for each model: the bias-corrected AIC (AICc), and the Bayesian Information Criterion (BIC). These quantities are still log-likelihood values penalized by model complexity. However the degree to which they punish extra parameters varies, with the BIC being most sensitive to the AIC being the least.

Want to know more about Time Series?

If you have any questions or you’d like to learn more about how Time Series work, please visit the dedicated release page. It includes a series of six blog posts about Time Series, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Automating Time Series with WhizzML

by

Since the beginning of our civilization, humans have worried about the future. In particular, we worry about predicting the future. It’s widely known that in ancient Greece, the most famous oracle was in Delphi. Greek people went there to find out about their future and to decide what they should do to turn their fortunes around. Three thousand years later, the worry about how to act in the future remains, however, we’ve learned to base our decisions on the algorithms and science inherited from Pythagoras, Euclid, Thales or Archimedes rather than on Pythia’s words.

oraculo_griego_ml

To continue with our series of posts about Time Series, this fifth blog post focuses on WhizzML users. WhizzML is our Domain Specific Language for Machine Learning workflow automation, which provides programmatic support for all your BigML resources in a way that is completely executed in the BigML back-end.

Every resource in BigML can be managed through WhizzML. It follows naturally that you can now use WhizzML scripts to create Time Series models and make forecasts with it.

For the Time Series resource, we are going to begin explaining how to split a dataset based on the range parameter as it’s important to keep the data in the same order for creating the Time Series and also for evaluating it. As explained in a previous blog post, when testing other supervised models, we can use as test data any randomly sampled subset of instances. However, evaluating time series requires the hold out to be a range of our data, as the training-test splits must be sequential. For an 80-20 split, the test set is the final 20% of rows in the dataset. WhizzML calculates the split ranges by specifying a percentage of rows. In the script snippet below, we set aside 80% of our rows.

Screen Shot 2017-07-17 at 03.22.21

The function linear-split receives a dataset and a percentage, and it creates two complementary datasets, one for testing test-ds and one for training train-ds, by splitting the existing rows in two ranges. The one above is the largest script in this post, so it only gets easier from here on.

Now let’s see how to create a Time Series that models our data. Because we would like to evaluate our Time Series later on, we will use the training_ds_id that is output by our script. In fact, the unique mandatory parameter to create a Time Series is the dataset ID used for training. You can also specify which ETS models you would like to generate. By default, BigML will explore all of them. So the simplest code to create a Time Series is as easy as the example that follows.

Screen Shot 2017-07-18 at 11.45.12

In case you have more information about your data, you might want to use other Time Series creation parameters. The full list of parameters can be found in the Time Series section of the API documentation.

For monthly data and seasonal activities, you can set an additive trend for your data with its seasonality set to additive (with value 1) and its period specified as 12. Then you may fill the properties in the function that creates the time series as in the example below.

Screen Shot 2017-07-18 at 11.49.42

This is a good point to remind ourselves about the fact that most WhizzML requests are asynchronous. Thus, it’s quite possible that you will need to wait for the resource to finish before referring to it in other scripts or accessing its properties. For the previous example, the code would look like this:

Screen Shot 2017-07-18 at 11.52.17

Once the Time Series has been created, we can evaluate how good its fit is. Remember that our original dataset was chronologically split into two parts. Now, we will use the remaining 20% of the dataset to check the Time Series model performance.  The test-ds parameter in the code below represents the second part of the dataset. Following WhizzML’s less is more philosophy, creating an evaluation requires a simple code snippet with only two mandatory parameters: a Time Series to be evaluated and a dataset to use as test data.

Screen Shot 2017-07-18 at 11.55.00

In the evaluation object, there are some measures for each one of the ETS models in the Time Series. For more on this, see section 5 of our previous post.

After evaluating your Time Series, what’s next is calling for the aforementioned “Modern-day Oracle”. Once you build a Time Series with the entire original dataset, that is, including the hold-out rows, you can predict the future values of one or many features in your data domain. In this code, we demonstrate the simplest case, where the forecast is made only for one of the fields in your dataset.

Screen Shot 2017-07-17 at 10.53.26

As a developer, the cool part of WhizzML is that it allows you to create a complete script with all the steps you need and execute them in the cloud or in an on-premises deployment with a single call. This takes advantage of the service’s built-in scalability and parallelization capabilities and minimizes latency or possible network brittleness while exchanging information with the cloud. You can do this by directly creating an execution of your script either directly in the BigML API or by using any of the existing BigML Bindings.

BigML supports bindings in different programming languages that allow you to create not only the resources available in the platform, such as Time Series and Evaluations, but also Scripts and Executions. Everything can be managed from your favorite programming languages like Python, Node.js among many others. You can see the complete list of our bindings and the related documentation on our dedicated Tools page.

Want to know more about Time Series?

If you have any questions or you’d like to learn more about how Time Series work, please visit the dedicated release page for further learning. It includes a series of six blog posts about Time Series, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Programming Time Series with BigML’s API

In this blog post, the fourth one of our series of six, we want to provide a brief summary of all the necessary steps to create a Time Series using the BigML API. As stated in our previous post, Time Series is often used to forecast the future values of a numeric field, which is sequentially distributed over time such as stock prices, sales volume or industrial data among many other use cases.

The API workflow to create a Time Series includes five main steps: first, upload your data to BigML, then create a dataset, create your Time Series, evaluate your Time Series and finally make forecasts. Note that any resource created with the API will automatically be created in your Dashboard too so you can take advantage of BigML’s intuitive visualizations at any time.

In case you never used the BigML API before, all requests to manage your resources must use HTTPS and be authenticated by using your username and API key to verify your identity. Find below a base URL example to manage Time Series:

https://bigml.io/timeseries?username=$BIGML_USERNAME;api_key=$BIGML_API_KEY

1. Upload your Data

You can upload your data, in your preferred format, from a local file, a remote file (using a URL) or from your cloud repository, e.g., AWS, Azure etc. This will automatically create a source in your BigML account.

First, you need to open up a terminal with curl or any other command-line tool that implements standard HTTPS methods. In the example below, we are creating a source from a local CSV file containing the monthly gasoline demand in Ontario from 1960 until 1975 that we previously downloaded from DataMarket.

curl "https://bigml.io/source?$BIGML_AUTH"
      -F file=@monthly-gasoline-demand-ontario-.csv

Remember that Time Series need to be trained with time-based data. BigML assumes the instances in your source data are chronologically ordered, i.e., the first instance in your dataset will be taken as the first data point in the series, the second instance is taken as the second data point and so on.

2. Create a Dataset

After the source is created, you need to build a dataset, which serializes your data and transforms it into a suitable input for the Machine Learning algorithm.

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"source":"source/595f76362f1dfe13c4002737"}'

3. Create your Time Series

You only need your dataset ID to train your Time Series and BigML will set the default values for the rest of the configurable parameters. By default, BigML takes the last valid numeric field in your dataset as the objective field. You can also configure all Time Series parameters at creation time. You can find an explanation of each parameter in the previous post.

You can evaluate your Time Series performance with new data. Since the data in Time Series is sequentially distributed, a quick way to train and test your model with different subsets of your dataset is by using the “range” parameter. The “range” allows you to specify the subset instances that you want to use when creating and evaluating your model. For example, if we have 192 instances and we want to take the 80% for training the model and the 20% for testing it, we can set a range of 1 to 154 so the Time Series only uses those instances. Then we will be able to evaluate the model using the rest of instances (from 155 to 192).

curl "https://bigml.io/timeseries?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/98b5527c3c1920386a000467", 
            "range":[1,154]}'

When your Time Series is created, you will get not one but several models in the JSON response. These models are the result of the combinations of the Time Series components (error, trend, and seasonality) and their variations (additive, multiplicative, damped/not damped) explained in this blog post. Each of these models is identified by a unique name that indicates the error, trend and seasonality components of that particular model. For example, the name M,Ad,A indicates a model with Multiplicative errors, Additive damped trend and Additive seasonality. You can perform evaluations and make forecasts for all models or you can select one or more specific models.

4. Evaluate your Time Series

When your Time Series has been created, you can evaluate its predictive performance. You just need to use the Time Series ID and the dataset containing the instances that you want to evaluate. In our example, we are using the same dataset that we used to create the Time Series. For the evaluation, we use the range from 155 to 192, which contains the last instances in the dataset that weren’t used to train the model.

curl "https://bigml.io/evaluation?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"dataset":"dataset/98b5527c3c1920386a000467", 
            "timeseries":"timeseries/98b5527c3c1920386a000467"
            "range":[155,192]}'

Evaluations for Time Series generate some well-known performance metrics such as MAE (Mean Absolute Error), MSE (Mean Squared Error) or R squared. You will also get other not-so-common ones like sMAPE (symmetric Mean Absolute Percentage Error), which is similar to MAE except the model errors are measured in percentage terms, MASE (Mean Absolute Scaled Error) and MDA (Mean Directional Accuracy), which compares the forecast direction (upward or downward) to the actual data direction. You can read more about these metrics in this article.

5. Make Forecasts

When you create a Time Series, BigML automatically forecasts the next 50 data points for each model per objective field. You can find the forecast along with the confidence interval (an upper and lower bound, where the forecast is located with 95% confidence) in the JSON response of the Time Series model.

If you want to perform a forecast for a longer time horizon, you can do it by using your Time Series ID as in the example below.

curl "https://bigml.io/forecast?$BIGML_AUTH"
       -X POST
       -H 'content-type: application/json'
       -d '{"timeseries":"timeseries/98b5527c3c1920386a000467"
            "horizon":100}'

Want to know more about Time Series?

Please visit the dedicated release page for further learning. It includes a series of six blog posts about Time Series, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Creating your First Time Series Model with BigML’s Dashboard

BigML is bringing Time Series to the Dashboard to help you forecast future values based on your historical data. Time Series is widely used in forecasting stock prices, saleswebsite traffic, production and inventory levels among other use cases. This type of time-based data shows the common attribute of sequential distribution over time.

In this post, we will cover the six fundamental steps it takes to make forecasts by using Time Series in BigML, where you: upload your data, create datasets, train a Time Series model, analyze the resultsevaluate your model and make forecasts.

forecast_workflow2.png

To illustrate each of these steps, we will use a dataset from DataMarket which contains the monthly gasoline demand in Ontario from 1960 until 1975. By looking at the chart below, we can observe two main patterns in the data: the seasonality  (more demand during summer vs. winter months) and an increasing trend over the years.

gas-demand.png

1. Upload your Data

Upload your data to your BigML account. BigML provides many options to do so, in this case, we drag and drop to the Dashboard the dataset we have previously downloaded from DataMarket.

When you upload your data, the data type of each field in your dataset will be automatically detected by BigML. Time Series models will only use the numeric values in your dataset e.g., monthly gas demand field expressed in millions of gallons.

source-ts.png

Important Note!

BigML indexes your instances in the same order they are arranged in the original source, i.e., the first instance (or row) is taken as the first data point in the series, the second instance is taken as the second data point and so on. Therefore you need to ensure that your instances are chronologically ordered in your source data.

2. Create a Dataset

From your Source view, in the 1-click action menu, use the 1-click Dataset option to create a dataset. This is a structured version of your data ready to be consumed by a Time Series.

dataset-ts.png

Since Time Series is considered a supervised model you can evaluate it. You can use the 1-click split option to set aside some test data to later evaluate your model against. Since Time Series data is sequentially distributed, the split of the dataset needs to be linear instead of random. Using the configuration option shown in the image below, the first 80% instances of your dataset will be set aside for training and the last 20% for testing.

linear-split.png

3. Create your Time Series

To train your Time Series you can either use the 1-click Time Series option or you can configure the parameters provided by BigML. BigML allows you to configure the following parameters:

  • Objective fields: these are the fields you want to predict. You can select one or more objective fields and BigML will learn the Time Series models for each field separately, which further streamlines the training process.
  • Default Numeric Value: if your objective fields contain missing values, you can easily replace them by the field mean, median, maximum, minimum or zero. By default, they are replaced by using spline interpolation.
  • Forecast horizon: BigML presents a forecast along with your model creation so you can visualize it in the chart. The horizon is the number of data points that you want to forecast. You can always make a forecast for longer horizons once your model has been created.
  • Model components: BigML models your data by exploring different variations of the error, trend and seasonality components. The combinations of these components result in the multiple models returned (see the introductory blog post of this series for a more detailed explanation):
    • Error: represents the unpredictable variations in the Time Series data, and how they influence observed values. It can be additive or multiplicative. Multiplicative error is only suitable for strictly positive data. By default, BigML explores all error variations.
    • Trend and damped: the trend component can be additive, generating a linear growth of the Time Series, or multiplicative, generating an exponential growth of the Time Series. Moreover, if a damped parameter is included, the trend of the Time Series will become a flat line at some point in the future. By default, BigML explores all trend variations.
    • Seasonality:  if your data contains fixed periods or fluctuations that occur in regular intervals, you need to include the seasonality component to your models. It can be additive or multiplicative, and the latter makes the seasonal variations proportional to the level of the series. By default, BigML explores both methods.
    • Period length: the number of data points per period in the seasonal data. The period needs to be set taking into account the time interval of your instances and the seasonal frequency. For example, for quarterly data and annual seasonality, the period should be 4, for daily data and weekly seasonality, the period should be 7.
  • Dates: you can set dates for your data to visualize them afterward in the x-axis of the Time Series chart. BigML will calculate the dates for each instance by referencing the initial date associated with the first instance in your data and the row interval.
  • Range: you may want to use a subset of instances to create your Time Series. This option is also handy if you haven’t yet split your dataset into training and tests sets.

In our example, we configure the seasonality component by selecting “All”, so BigML explores all possible seasonal combinations (additive and multiplicative) and a period length of 12 since we have monthly data and annual seasonality. We also select the initial date of our dataset with a row interval of 1 month because each of the instances represents a single month of data. At that point, we can simply click the Create button to form our Time Series.

config-ts.png

4. Analyze your Results

When your Time Series has been created you will see your field values and the best Time Series model plotted in a chart. As we mentioned before, BigML learns multiple models as a result of the different components combinations. The best model is selected taking into account the AIC (Akaike’s Information Criterion), but you can use any other of the metrics offered such as the AICc (Corrected Akaike’s Information Criterion), the BIC (Schwarz Bayesian Information Criterion) or the R squared. The preferred metric to select the best model is usually the AIC (or its variations, the AICc or the BIC) rather than the R squared, since it takes into account the trade-off between the model’s goodness-of-fit and the model’s complexity (to avoid overfitting) while the R squared only measures the degree of adjustment of the model to the data. The AICc is a variation of the AIC for small datasets and the BIC introduces a heavier penalization on model complexity.  To learn more about the model metrics read this article.

ts-view1.png

Below the chart, if you display the panel you will find a table containing all the different models learned from your data. You can visualize them by plotting them on the chart. Each model has a unique name which identifies its different components: Error, Trend, Seasonality. In our example, the model A,A,A is a model with Additive error, Additive trend, and Additive seasonality.

ts-view.png

5. Evaluate your Time Series

You can evaluate a Time Series model using data that the model has not seen before. Just click on the Evaluate option in the 1-click menu and BigML will automatically select the remaining 20% of the dataset that you set aside for testing.

evaluate.png

When the evaluation has been created, you will be able to see your model plotted along with the test data data and the model forecasts. By default, BigML selects the best model by R squared measure, which quantifies the goodness-of-fit of the model to the test data and it can take values up to 1. You will also get different performance metrics for each of your models such as the MAE (Mean Absolute Error) and the MSE (Mean Squared Error). The lower the MAE and the MSE and the higher the R squared the better. You can see below that our model has a very good performance on the test data with an R squared of 0.9833.

ts-eval.png

The table within the panel below displays all the related models and other metrics such as sMAPE (symmetric Mean Absolute Percentage Error), which is similar to the MAE except the model errors are measured in percentage terms, the MASE (Mean Absolute Scaled Error) and the MDA (Mean Directional Accuracy), which compares the forecast direction (upward or downward) to the actual data direction. See this article for detailed explanations.

6. Make Forecasts

From your model view, you will be able to see the forecasts of your selected models with up to 50 future data points. If you want to predict a longer horizon, you can click on the option to extend it. You can also compare your model forecast with three other benchmark models: a model that always predicts the mean, a naive model that always predicts the last value of the series and a drift model that draws a straight line between the first and last observation of the series and extrapolates the future values.

ts-forecast.png

Want to know more about Time Series?

Please visit the dedicated release page for further learning. It includes a series of six blog posts about Time Series, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Welcoming Enrique Dans to the Valencian Summer School in Machine Learning

As the dates near, and given the initial wave of response, we are getting excited about the upcoming Summer School in Machine Learning 2017. This edition will be held at the Veles e Vents building located close to Valencia’s scenic waterfront on September 14-15.

The VSSML17 is a two-day course for advanced undergraduates as well as graduate students and industry practitioners seeking a quick, practical, hands-on introduction to Machine Learning. The last edition was completed by over 140 participants from 19 countries representing 53 companies and 21 academic organizations. This year we have room for over 200 attendees and 26 countries are represented among applicants so far!

We are happy to share that BigML’s Strategic Advisor, prolific Spanish blogger, and IE Business School Professor, Enrique Dans, will be giving a special talk on September 14, at 06:00 PM CEST, at the end of the first day of the Summer School. Enrique Dans will explain the impact that Machine Learning is having in the real world context of business organizations as they go through their digital transformation.

Professor Enrique has a Ph.D. in Management from University of California, Los Angeles and an MBA from Instituto de Empresa (IE). He completed his post-doc studies at Harvard Business School. He is also the author of the best-seller “Everything is going to change.” Among his other qualifications, he serves as the Information Systems and Information Technology Chair at IE Business School. He was also one of the distinguished speakers at the 2015 conference: Technical and Business Perspectives on the Current and Future Impact of Machine Learning.

In the past, the skill set required to develop real-life Machine Learning applications have mostly remained in the playground of the few privileged academics and scientists. Times have changed and many businesses have come to the realization that their workforce can’t afford to stay behind the curve on this key enabler. So we urgently need to produce a much larger group of ML-literate professionals in an inclusive manner that appeals to developers, analysts, managers and subject matter experts alike. Professor Dans’ speech will go into detail on how to best incorporate Machine Learning in your future strategies while launching new types of products and services nobody even dreamt of until recently. 

Don’t miss this groundbreaking, hands-on Machine Learning event. Get your ticket before long, as there are few spaces left. We are looking forward to seeing you in Valencia!

%d bloggers like this: