There’s a lot of hype these days around predictive analytics, and maybe even more hype around the topics of “real-time predictive analytics” or “predictive analytics on streaming data”. Like most things that are over-hyped, what is actually meant by the term is often lost in the noise. In this case that’s really a shame, because these terms refer to at least two different things, either or both of which may be important in a given context.
This post forms the basis of a lightning talk I’m giving (remotely) at the Real-time Big Data Meetup in Menlo Park, California. Join the group if you’re interested!
What is Machine Learning from Streaming Data?
Generally, when I hear people talking about “machine learning from streaming data”, they may be talking about a couple of things.
- They want a model that takes into account recent history when it makes its predictions. A good example is the weather; If it has been sunny and 80 degrees the last two days, it is unlikely that it will be 20 and snowing the next day.
- They want a model that is updatable. That is, they want their model to in some sense “evolve” as data streams through their infrastructure. A good example might be a retail sales model that remains accurate as the business gets larger.
These two phenomena sound like the same thing, but they are potentially very different. The central question is whether the underlying source generating the data is changing. In the case of the weather, it really isn’t (okay, okay): Given the weather from the previous few days you can usually make a pretty good guess at the weather for the next day, and your guess, given recent history will be roughly the same from year to year. The same model for last year will work for this year.
In the case of the business, the underlying source is changing; the business is growing, and so your guess of the sales given the previous few days of sales is probably going to be different from last year. So last year’s data, when the business was small, is really not relevant to this year, when the business is large. We need to update the model (or scrap it completely and retrain) to get something that works.
The first case, where you want the prediction conditioned on history, I’m going to call time-series prediction. That problem deserves a post all its own, but it suffices to say that solutions to this problem revolve largely around feeding the prediction history to the model as input. That’s a massive oversimplification, but there’s plenty of information out there if you’re more interested.
The second case, where you need to update the model or retrain completely, is about dealing with non-stationarity, and that’s largely what the rest of this post is going to be about. But consider this first lesson learned: Time series prediction and non-stationary data distributions are two different problems.
When I think about the second case above, a couple of classes of approaches jump to mind:
- Incremental Algorithms: These are machine learning algorithms that learn incrementally over the data. That is, the model is updated each time it sees a new training instance. There are incremental versions of Support Vector Machines and Neural networks. Bayesian Networks can be made to learn incrementally.
- Periodic Re-training with a batch algorithm: Perhaps the more straightforward solution. Here, we simply buffer the relevant data and retrain our model “every so often”.
Note that any incremental algorithm can work in a batch setting (by simply feeding the input instances in the batch into the algorithm one after another). The reverse, however, isn’t trivially true. Many batch algorithms can only be made to work incrementally with significant work or power sacrifices, and some things just can’t be done.
So is the sacrifice worth it? Here are a couple of considerations to think about:
- Data Horizon: How quickly do you need the most recent datapoint to become part of your model? Does the next point need to modify the model immediately, or is this a case where the model needs to behave conditionally based on that point? If it is the latter, perhaps this is a time-series prediction problem rather than an incremental learning problem.
- Data Obsolescence: How long does it take before data should become irrelevant to the model? Is the relevancy somehow complex? Are some older instances more relevant than some newer instances? Is it variable depending on the current state of the data? Good examples come from economics; generally, newer data instances are more relevant. However, in some cases data from the same month or quarter from the previous year are more relevant than the previous month or quarter of the current year. Similarly, if it is a recession, data from previous recessions may be more relevant than newer data from a different part of the economic cycle.
With these two concerns in mind, along with architecture and implementation, you can get a pretty good idea of whether you’re looking at a problem where incremental learning is desirable or not.
Obviously, the shorter the data horizon, the more likely you are to want incremental learning. However, it’s a common mistake to confuse a short data horizon with a time-series prediction problem: If you want your model to behave differently based on the last few instances, the right thing to do is condition the behavior of the model on those instances. If you want the model to behave differently based on the last few thousand instances, you may want incremental learning.
This, however, is where the second concern rears its ugly head. Incremental learners all have built-in some parameter or assumption that controls the relevancy of old data. This parameter may or may not be modifiable and the relationship may be complex, but the algorithm will be making some implicit assumption about how relevant old data is. This is the second lesson: Be wary of the data relevance assumptions made by incremental learning algorithms!
By contrast, retraining in batches has lots of flexibility in this regard. It is easy to select data for retraining, filter by relevant criteria, even weight the data according to some relevancy function using one of the many batch training algorithms that take weighting into account. There’s even been some recent work in automatically detecting when retraining is necessary, based essentially on how different the incoming data is from previous recent data.
First of all, don’t confuse learning from streaming data with time series prediction; while the data sources from the two problems look similar, the two concerns are often orthogonal.
Incremental learning is great for two cases: First, simplicity. There’s no buffering and no explicit retraining of the model. Second, speed. You always have a model that’s up to date. You make sacrifices, however, in terms of the power of the model that you can learn and the flexibility of the model to incorporate old data to different degrees. There are also some corner cases where incremental learning is necessary, such as when data privacy demands that instances be discarded immediately after they are seen.
Periodic retraining requires more decisions and more complex implementation. However, you get all of the power of any supervised classification algorithm, and specialized tools can be built on top of it to allow you to retrain on only relevant data and only when necessary. It also offers the nice benefit of being able to plug-and-play different machine learning algorithms into your architecture with a minimum of hassle, as the learning bit is built completely from off-the-shelf algorithms.
BigML has mainly gone the second route, allowing you to easily upload new data and trigger retraining (either manually or via the BigML API) whenever you find it necessary. One thing on our to do list is to implement some of those specialized tools on top of the BigML API, so that data can be streamed to BigML and the model is automatically re-trained only when it needs to be. We’ll update you on our blog about any big changes!
EDIT: This article originally used the terms “classifier” and “predictor” more or less interchangeably, and they’re not really interchangeable as one astute reader pointed out. I’ve replaced both with the term “model”, which follows the convention we’ve been using at BigML. Sorry for any confusion that might have caused!
Interesting link to adapting bagging methods for data streams. On the surface, it looks very similar to the work done previously for step based linear regression models. However, the additional runtime cost is too high for big data and the Hoeﬀding Adaptive Trees method doesn’t offer a continuos transition from one decision tree to the other.
Hi James – thanks for your input. Bifet has done a lot of other work in that area that you might find more helpful. Here’s an example:
Click to access R09-9.pdf
Or just check out his entire bibliography (the work on learning under concept drift starts around ’08/’09):
As he says in the above reference, a lot of the algorithmic choices he makes are to preserve the nice generalization bounds of the models he’s using, and if one’s primary concern is efficiency, I’m sure one can do a lot better.
More broadly, though, I really like the idea of a principled use of concept drift detection to trigger model updates. My experience is that people trying to learn from non-stationary data streams generally just make ad-hoc choices about when to retrain their model, based on almost anything except changes in the underlying distribution. Surely, we can do better than that.
I’m not super-familiar with the work in this area, so I (and other readers) would be grateful for any references you’d like to provide!
Thanks for the post, very timely for my work.
I’m failing to understand the distinction you make between time-series prediction and the construction of an incremental model. If you had an incremental model you could use it to predict future data right? If the difference is the amount of data used to predict the future, an incremental model with learning parameters chosen such that old data was down weighted very quickly would be the same as a time-series predictor that was fed only a short small window of data?
The distinction seems to exist because an incremental model could be used for much more than prediction. That being the case I think the time-series prediction problem seems to be a subset of the incremental model problem.
Or have I misunderstood?
Hi Sina – Thanks for your comment.
There are two comparisons I am making above:, incremental vs. batch (types of learning algorithms) and time-series vs. non-stationary data (types of data problems).
In the former pair, incremental vs. batch, one is indeed a subset of the other; As I mention above, any incremental algorithm can be made to work in a batch fashion, but not vice-versa.
In the latter pair, a problem can be either, neither, or both. There are many problems on non-stationary data that have no time-series component. A good example might be something like recognizing corporate logos in images. Say you have a classifier that can find a corporate logo given an image. There is no obvious time-series aspect to this problem (the ordering of the images is not important), but your classifier may still go obsolete because the target logo may change, so the data might be considered non-stationary. If the logo does change, your classifier must either be updated or retrained completely.
There may also be time-series problems with stationary data (given the time series). The weather problem above is such a problem. The business problem above has both non-stationarity and time-series aspects.
As to which algorithms (incremental vs. batch) can be used with which types of data (non-stationary data vs. time series data), you correctly point out above that more or less anything is possible. It is hard to say without seeing the data what the best solution will be.
Importantly, I’m discussing *mainly the non-stationarity issue in the post above*, and talking about incremental vs. batch in that light. If your problem *does* have a time-series component to it it will probably benefit from one of the many approaches built particularly for time-series data (HMMs CRFs, sliding window classifiers, etc.) I don’t go into these at all in the post but there are references.
I hope that makes things more clear. It’s a big topic and I’m just scratching the surface, but understanding the difference between time-series problems and non-stationary data is very important.