Behind the Scenes of BigML’s Time Series Forecasting

Posted by

BigML’s Time Series Forecasting model uses Exponential Smoothing under hood. This blog post, the last one of our series of six about Time Series, will explore the technical details of Exponential Smoothing models, to help you gain insights about your forecasting results.

Exponential Smoothing Explained

To understand Exponential Smoothing, let’s first focus on the smoothing part of that term. Consider the following series, depicting the closing share price of EBAY over a 400 day period.


There is definitely some shape here, which can help us tell the story of this particular stock symbol. However, there are also quite a few transient fluctuations which are not necessarily of interest. One way to address this is to run a moving average filter over the data.  ebay-ma

The output of the moving average (MA) filter is shown as the blue line. At each time index, we compute the filtered data point as the arithmetic mean of the unfiltered data points located within a window of fixed width m about that time index. Given time series data y, a (symmetric) moving average filter produces the filtered series:


As seen in the figure, the resulting filtered time series contains only the large scale movements in the stock price, and so we have successfully smoothed the noise away from the signal.

When we apply Exponential Smoothing to a time series, we are performing an operation that is somewhat similar to the moving average filter. The exponential smoothing filter produces the following series:

\ell_t = \alpha y_t + (1 - \alpha)\ell_{t-1}

Where 0 < \alpha < 1 is the smoothing coefficient. In other words, the smoothed value l is the \alpha-weighted average between the current data point and the previous smoothed value. If we substitute the value for  \ell_{t-1} , we can rewrite the exponential smoothing expression like so:


Where \ell_0 is the initial smoothed state value. Here, we see that the exponentially smoothed value is a weighted sum of the original data points, just as with the MA filter. However, whereas the MA filter computes a uniformly-weighted sum over a window of constant width, the exponential smoother computes the sum going all the way back to the beginning of the series. Also the weights are highest for the points closest to the current time index, and decrease exponentially going back in time. To verify that this produces a smoothing effect, we can apply it to our EBAY data and look at the results.


Why would we choose to smooth a time series using an exponential window instead of moving average? Conceptually, the exponential window is attractive because it allows the filter to emphasize a point’s immediate neighborhood without completely discarding the time series’ history. The fact that the parameter \alpha is continuously-valued means that there is more freedom to fine-tune the smoother’s fit to the data, compared to the moving average filter’s integer-value parameter.

Now, the other half of time series modeling is creating forecasts. Both the moving average and exponential smoother have a flat forecast function. That is, for any horizon h beyond the final data point, the forecast is just the last smoothed value computed by the filter.

\hat{y}_{t+h|t} = \ell_t

This is admittedly quite a simplistic result, but for stationary time series, these forecast values can be usable for reasonably short horizons. In order to forecast time series which exhibit more interesting movement, we need to incorporate trend into our model.

Trend models

In the previous section we smoothed a time series using a single pass of an exponential window filter, resulting in a “level-only” model which produces flat forecasts. To introduce some motion into our exponential smoothing forecasts we can add a trend component to our model.  We will define trend as the change between two consecutive level values \ell_{t-1} and \ell_t, and then interpret this purposefully vague definition in two ways:

  1.  The difference between consecutive level values (additive trend):                                                   r_t=\ell_t-\ell_{t-1}
  2.  The ratio between consecutive level values (multiplicative trend):                                                   r_t=\ell_t/\ell_{t-1}

We can then perform exponential smoothing on this trend value, in an identical fashion to the level value:

b_t=\beta r_t + (1-\beta)b_{t-1}

Where  0 < \beta < 1 is the trend smoothing coefficient.  This combination of exponential smoothing for level and trend is frequently referred to as Holt’s linear or exponential trend method, after the author who first described it in 1957. The forecast for a given horizon h from an exponential smoothing model with trend is simply the most recent level value, with the smoothed trend applied h times. That is,

y_{t+h|t} = \ell_t +hb_t \quad \textrm{or} \quad y_{t+h|t} =  \ell_t b_t^h

Hence, for additive trend models, the forecast is a straight line, and for multiplicative trend models, the forecast is an exponential curve. For some cases, it may be undesirable for the trend to continue at a constant value as the forecast horizon grows. We can introduce a damping coefficient 0 < \phi < 1 and reformulate the smoothing equations. The forecast, level, and trend equations for a damped additive trend model are:


and for multiplicative trend:

Seasonal models

Many time series exhibit seasonality, that is, a pattern of variation that takes over consecutive periods of fixed length. For example, alcohol sales may be higher during the summer than the winter, year after year, so a time series containing monthly sales figures of beer could exhibit a seasonal pattern with a period of m=12. Once again, seasonality can be modeled additively or multiplicatively. In the case of the former, the seasonal variation is independent of the level of series whereas in the latter, the variation is modeled as a proportion of the current level.

To bring it all together, the following is an example of a time series which exhibits both trend and seasonality.


Note how the level is a smoothed version of the observed data, and the trend (labeled “slope”) is more or less the rate of change in the level.

Learning exponential smoothing models

Exponential smoothing models are fully specified by their smoothing coefficients α,β,γ , and φ along with initial state values l0, b0, and s0 (the remaining state values are obtained by running the smoothing equations forward). To evaluate how well an exponential smoothing model fits the data, we compute what is called the “within-sample one step ahead forecast error”. Put plainly, for each time step t, we compute the forecast for one step ahead, and calculate the error between the forecast and the actual data from the next time step.


We compute these errors for each time step where we have observed data available, and the sum of squared errors is our metric for model fit. This metric is then used to perform numeric optimization in order to obtain the best values for the smoothing coefficients and initial state values. BigML uses the Nelder-Mead simplex algorithm as its optimization solution.

Model Selection

Considering all the different combinations of trend and seasonality types for exponential smoothing can mean that we must choose among over a dozen different model configurations for a time series modeling task. Therefore, we need some way to rank the models against each other. Naturally, the ranking should incorporate a measure of how well the model fits the training time series, but should also help us avoid models which overfit the data. The tool that fits these requirements is the Akaike Information Criterion (AIC). Let \hat{L} be maximum likelihood value of the model, computed from the sum of squared errors between the model fit and the true training values. Let k be the total number of parameters required by the model type. For example an A,Ad,A model with seasonality of 4 uses 10 parameters: 4 smoothing coefficients and 6 initial state values (one level, one trend, and 4 seasonality). The AIC is defined by the following difference.


Models which produce lower AIC values are considered better choices, so the best model is the one which maximizes the likelihood L, while minimizing the number of parameters k. Along with the AIC, BigML also computes two additional metrics for each model: the bias-corrected AIC (AICc), and the Bayesian Information Criterion (BIC). These quantities are still log-likelihood values penalized by model complexity. However the degree to which they punish extra parameters varies, with the BIC being most sensitive to the AIC being the least.

Want to know more about Time Series?

If you have any questions or you’d like to learn more about how Time Series work, please visit the dedicated release page. It includes a series of six blog posts about Time Series, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s