I work at a consumer web company, and recently used BigML to understand what drives return visits to our site. I followed Standard Operating Procedure for data mining, sampling a group of users, dividing them into two classes, and creating several features that I hoped would be useful in predicting these classes. I then fed this training data to BigML, which quickly and obediently produced a decision tree:
Next I used BigML’s interface to examine the tree’s many subsets, shown as “nodes” in the diagram above. I moused over a node at the top of the tree and saw that it achieved high separation for a large fraction of the training set:
This one node covered 58% of the data, and separated the two classes with 73% confidence. (“Confidence” is a measure of node purity, and for this node 73% of the data belongs to class “0″.) With a little more work, I found another node that covered another 22% of the data, this time predicting class “1″. For the remaining 20% of data, the best rule I could find (after much mousing-over in the tree) was a single node with a lousy 51% confidence—barely an improvement over flipping a coin. I affectionately named these nodes Rule 1, Rule 2 and Blind Spot.
This is a common use for decision trees: gaining insight by finding the “best” nodes as measured by the fraction of data they cover (“support”) and their purity (“confidence”). When exploring a decision tree for insight, the goal is to find the smallest collection of useful rules that accurately summarizes your data.
Today BigML is launching the SunBurst visualization, which makes it much easier to gain insight from decision trees. Below is a SunBurst viz of my tree: the nodes are now shown as arcs, with the number of radians representing support and the color representing confidence. With a minimum of manual searching, I can easily find Rule 1, Rule 2 and especially the Blind Spot (which, together with its subsets, stands out in ugly, non-predictive brown):
Let us ponder the amazing feat this Burst of Sun has achieved. In an eight-dimensional data set of 48,000 instances, I can see immediately which nodes have the highest combination of support and confidence. But wait, there’s more: I can also see exactly how all of the nodes fit together in a tree hierarchy, which gives me further insight into the data. For example, the upper right of the tree shows several large subsets of Rule 1 that glow bright green, and a closer look reveals that these subsets stack like Russian nesting dolls, each one prettier (but smaller) than its parent. So if I felt that Rule 1 misclassified too many instances, I could easily select one of its prettier children instead, choosing higher confidence at the cost of lower support.
Try out the SunBurst confidence visualization for yourself by training a model, going to the Models tab, and clicking the hypnotic SunBurst icon:
Then just click the Confidence icon. May all your nodes be green!
Better visualization obviously does not solve everything, and the usual cautions about understanding your data and validating your model still apply. While the above model is useful (since I haven’t said anything about the actual data it’s trained on, you’ll have to take my word for it), I still want to try a larger data set, and validate the resulting model by splitting the data into training and test. Perhaps most importantly, we cannot assume that two subsets with similar confidence but different predicted classes (like Rule 1 and Rule 2) will make equally valid rules in practice, since a false positive could be much, much worse than a false negative, or vice versa.
Nonetheless, the SunBurst is a huge leap forward in decision tree visualization. BigML’s Adam Ashenfelter explains that the SunBurst “may not be as intuitive as our regular tree view”, and he’s right—it’s about a thousand times more intuitive than the regular tree view. You can have my SunBurst when you pry it from my cold, dead retinas.
Accessing BigML via our REST API is easy, requiring only a username and an API Key. Every account registered with BigML automatically gets a master API Key which has full access to all capabilities within your account. That is, with the master key you can programmatically create, retrieve, update or delete sources, datasets, models, ensembles, predictions, and evaluations, all via the command line, any of the API bindings that we or our fans have been developing, or your own private implementation.
We even make finding and using the API Key easy. BigML’s web interface provides an icon for each resource that lets you get its URL with the api key already encoded, allowing you to access the resource directly from within your application.
However, although the power of your master API Key makes working with BigML’s API easy, it also comes with potential risk. There is no way to share access to your resources in a limited way, and if you do share your master API Key, then you are granting access to every capability in your account. The only method to mitigate this risk previously was the ability to recreate your master key on demand:
In order to address this limitation, our latest release brings the ability to add Alternative API Keys to your account with finer grained controls. You can define what resources a key can access and what operations (i.e., create, list, retrieve, update or delete) are allowed with it. This is useful in scenarios where you want to grant different roles and privileges to different applications. For example, an application for the IT folks that collects data and creates sources in BigML, another that is accessed by data scientists to create and evaluate models, and a third that is used by the marketing folks to create predictions.
We have implemented some logic behind the scenes to ensure that the permissions you assign are sound. For example, if you want a key to be able to create models, it must also be able to read datasets and models; similarly, if you want your API key to be able to create evaluations it must be able to read datasets, models, and also evaluations.
If you give Alternative API Keys a try please let us know what you think, especially if there is anything we could improve to make it more useful. We appreciate your feedback and are available to help!
One of the pitfalls of machine learning is that creating a single predictive model has the potential to overfit your data. That is, the performance on your training data might be very good, but the model does not generalize well to new data. Ensemble learning of decision trees, also referred to as forests or simply ensembles, is a tried-and-true technique for reducing the error of single machine-learned models. By learning multiple models over different subsamples of your data and taking a majority vote at prediction time, the risk of overfitting a single model to all of the data is mitigated. You can read more about this in our previous post.
Early this year, we showed how BigML ensembles outperform their solo counterparts and even beat other machine learning services. However, up until now creating ensembles with BigML has only been available via our API. We are excited to announce that ensembles are now available via our web interface and that they have also become first-class citizens in our API.
You can create an ensemble just as you would create a model, with the addition of three optional parameters:
- Whether your want fields to be selected randomly at each split (i.e., decision forest) or only bagging to be used.
- The number of models.
- The task level parallelism.
A Decision forest or random decision forest is created by selecting a random set of the input fields at each split node while an individual model in the ensemble is being built instead of considering all the input fields. This is the strategy that BigML uses by default. If you just want to use bagging you should deselect this option.
Bagging, also known as bootstrap aggregating, is one of the simplest ensemble-based strategies but often outperforms strategies that are more complex. This method uses a different random subset of the original dataset for each model in the ensemble. By default, BigML uses a sampling rate of 100% with replacement for each model, meaning that individual instances can be selected more than once from the dataset. You can select different sampling rates using the sampling configuration panels.
Number of Models
The default is ten, but depending on your data and other modeling parameters you might want to use a bigger number. Generally, increasing the number of models in an ensemble lowers the effect of noise and model variability, and has no downside except the additional cost to you, the user. The cases where more models are likely to be beneficial are when the data is not terribly large (in the thousands of instances or less), when the data is very noisy, and for random decision forests, when there are many correlated features that are all at least somewhat useful.
Keep in mind that each additional model tends to deliver decreasing marginal improvement, so if the difference between nine and ten models is very small, it is very unlikely that an eleventh model will make a big difference.
Task Level Parallelism
The task level parallelism is the level of parallelism that BigML will use to perform a task that is decomposable into embarrassingly parallel tasks like building the models of a random decision forest. We offer five different levels. In the lowest, sub-tasks will be performed sequentially and in the highest level up to 16 sub-tasks will be performed in parallel. The higher the level, the faster your ensemble will be finished. However, the more credits it will cost you.
You can also create an ensemble in just one click. By default, a 1-click ensemble will create a random decision forest of 10 models using 100% of the original dataset but sampling it with replacement.
Predicting with Ensembles
Once your ensemble is finished, creating a prediction is the same as creating a prediction with a single model, with one additional step; the predictions from the individual models of the ensemble must be combined into a final prediction. The default method for combining the predictions is pluarity vote for a classification ensemble and a simple average for a regression ensemble.
BigML offers three different methods for combining the predictions of an ensemble :
- Plurality - weighs each model’s prediction as one vote for classification ensembles. For regression ensembles, the predictions are averaged.
- Confidence Weighted - uses each prediction’s confidence as a voting weight for classification ensembles. For regression ensembles, computes a weighted average using the associated error as the weight.
- Probability Weighted - uses the probability of the class in the distribution of classes in the leaf node of each prediction as a voting weight for classification ensembles. For regression ensembles, this method is equivalent to the plurality method above.
Predictions take longer in ensembles than in single models, but you can also download ensembles using our download actionable ensemble button to perform low latency predictions directly in your applications. So far they are only available in Python but we’ll bring them to more programming languages soon. We also plan to bring high-performance predictions in an upcoming release, so stay tuned.
You can also evaluate an ensemble in the same way as a single model.
The level of accuracy achieved by ensembles of decision trees on previously unseen data very often outperforms most other techniques even if they are more sophisticated or complex. Not surprisingly, then, it is common to find random decision forests as one of the top performers in Kaggle’s competitions. Finally, ensembles of decision trees can be applied to perform a multitude of tasks such as classification, regression, manifold learning, density estimation, and semi-supervised classification in thousands of real-world domains. If you’re interested in a great monograph about random decision forests, we recommend this book.
We hope that you give BigML ensembles a try and let us know about your experience and results. Moreover, in this new release there are a number of small goodies like 1-click training|test split, alternative API keys to access to your BigML resources with different privileges, comparing evaluations, and many other things under the hood to make everything come together. We’ll explain it all in future blog posts!
If you’ve built decision trees with BigML or explored our gallery, then you should be familiar with our tree visualizations. They’re a classic and intuitive way to view trees. The root is at the top, its children are the next level down, the grandchildren are deeper still, and so forth.
While intuitive, this sort of visualization does have some drawbacks. Decision trees often grow too wide to comfortably fit the display area. We compensate by collapsing the less important parts of the tree and then letting the user choose where to drill down (either picking specific branches or with our filtering options). It works, and we’re happy with it as our default visualization. But it’s not the only way to look at a decision tree.
Recently we’ve explored SunBurst tree visualizations as a complement to our current approach. A SunBurst diagram is a little like nested pie charts. Instead of the traditional side view of the decision tree, it’s akin to viewing the tree from the top down. The root of the tree is a circle in the center of diagram and its children wrap around it. The arc length of each child corresponds to the percentage of the training set covered by the child.
We don’t have the visualization ready for production yet, but thanks to the wonderful D3 library and bl.ocks.org from Mike Bostock, we can give you a sneak peek. Below are two ways of viewing a decision tree built on the Iris dataset (the Hello World of machine learning). Click on the images to see either our default visualization or the preview of our SunBurst style.
You may have noticed that we do coloring a bit differently with the SunBurst style. In the regular tree viz, we color according to the field each node splits on. In the SunBurst style, we’re experimenting with coloring each node by the most common class. In other words, if you had to make a prediction at that spot in the tree, which class is the best choice? We also give you options if you’d rather see the tree colored by the split field or by prediction confidence.
While the Iris model is a fine example for getting used to the view, the SunBurst really shines with larger, more complex trees. Because it is a more space efficient way to layout a tree, we don’t need to worry about filtering or pruning. We can visualize the entire tree, making it easy to spot where the tree focuses on various classes or is most confident of its predictions. The model for the Forest Cover dataset illustrates this well.
With the traditional view, hovering the mouse over a tree node highlights the decision path to that node. The decision made at every level of the tree is displayed to the right (such as “Elevation > 2706″). With the SunBurst view we also highlight the decision path, but we’ve opted to collapse the individual decisions when showing the criteria for reaching a node (displayed in the lower left corner) . For example, a decision path three levels deep ["Elevation > 2706", "Elevation > 3062", "Elevation <= 3303"] might be reduced to a single rule: “3062 < Elevation <= 3303″.
Our new SunBurst view also supports regression trees. The example below shows a model built on the Concrete Compressive Strength dataset. The lighter blues indicate concrete mixes with more strength while darker blues mean weaker mixes.
The SunBurst visualization may not be as immediately intuitive as our regular tree view. Nonetheless we think it will help advanced users get more insight into their trees. We’re looking forward to adding this to BigML, and as always we welcome comments and suggestions!
Ensemble algorithms. Why do they work? It seems like kind of a crazy idea: Let’s take our original learning problem, create a bunch of new problems that aren’t quite the same but very related, and learn a model on each new problem. Then, when we want to classify a new point, we classify it using all of these models and combine predictions, usually using some form of voting.
Why bother with all this? Why not learn a single model over the original problem and be done with it?
Our Chief Scientist, Professor Tom Dietterich, has an older but excellent and still relevant tutorial paper that addresses this very question (you might also be interested in some of his other tutorials). As with our previous review of an interesting piece of machine learning literature, this paper is fairly lightweight for the academic literature but somewhat heavy compared to your average nancy drew mystery. So I’ll give you a little rundown here that should be a little easier to follow for the non-experts.
In that previous paper review, I talked a little bit about how machine learning algorithms looking for good models are a bit like people trying to find a good restaurant. People aren’t going to eat at every restaurant in an attempt to find the best one. They’re going to do things like ask friends and look at menus to filter out obvious losers, then eat at the ones they think have a pretty good chance of being good.
Now, if we extend that analogy, an ensemble learning algorithm is like a group of people all looking for a good restaurant in the same area, and getting together after their search to vote on the best one. Dietterich identifies three important reasons that this can be better than one person looking on her own. I like to think of these reasons as being related to three of the cardinal virtues in greek philosophy (mostly as one of my many attempts to amuse myself).
One person (whom we’ll call our singleton restaurant critic) is trying determine the best restaurant in the city. She narrows things down to, say, five really great restaurants. Among these great restaurants, the differences are so small that even a misplaced fork can change the critic’s decision.
The problem here is that our critic only has one vote, and must pick one, even though all seem equally good. If we had many critics, each eating a meal at each one, the differences might become more obvious. Maybe that fork is misplaced every night. Or maybe it’s not, but occasionally something is way over-cooked, or someone gets food poisoning. This will introduce votes against that restaurant into the ensemble, and finally make it less likely to be chosen in the aggregate over the restaurants that don’t make such mistakes.
This is the statistical advantage described by Dietterich. One problem produces only one guess at the correct model, and if there are many models that look good given the training data, we have an “all of our eggs in one basket”-style risk of picking the wrong model. By defining many related problems and voting on the correct answers, we reduce this risk of being too hasty in our choice of model.
Who is our singleton restaurant critic? Probably, as a human being, they have tastes like anyone else. Suppose she has a taste for Chinese food. This person might label a Chinese restaurant as the best, even if there’s a slightly better burrito place around the corner, just because of personal preference.
When you’ve got a group of critics, this is a lot less likely to happen. Some people might like Chinese food best, or hamburgers, or soup, but if all of them put the burrito place in second, you’ve got a pretty good idea that the burritos have the highest overall appeal.
This is the representational advantage for ensembles; even if it’s not in the “vocabulary” of the constituent models to find a great solution, a much better solution may well be in the vocabulary of averages of the constituent models. It’s the machine learning version of the wisdom of the crowd: While any individual might be foolish, the aggregate can sometimes make the right call with remarkable accuracy.
Our singleton restaurant critic has heard from one of her friends that the best food in town is in three neighborhoods. As her money and time are limited, she decides that, rather than look all over the city, she’s going to focus her search on these three neighborhoods.
But it turns out that, while this is true, there’s someone doing something amazing with gourmet hot dogs in another part of the city. An army of searchers, all with different friends, is likely to encounter that restaurant at least a few times.
This is what Dietterich calls the computational advantage for ensembles. Some learning algorithms may make decisions early on in their search that lead them to get stuck in local minima, meaning they find a solution that looks very good compared to all the ones they’ve seen, but there’s a better solution lurking elsewhere. By creating slightly different versions of the problem (different sets of friends, say), you make it likely that at least some of your critics will find that lurking solution. While individual learners are too slothful to explore the entire search space, an army does a better job.
The Rest of the Story
So we know it’s good to define multiple related learning problems and vote the answers. How do we do that? Section Two of Dietterich’s paper goes over some of the best ways we know. Section three gives some brief experiments that show some of the weaknesses of some of those methods that we’ve mentioned in this space before. I won’t go over the rest of the paper here, but I do encourage people interested in creating these ensembles to read that section as it’s an excellent introduction for those looking to go a little deeper into ensembles.
At BigML, you can already create some types of these ensembles via the command line, and we’ll soon bring this to the interface. Stay tuned!
In the video below we profile Tex Hardcattle and Brand McRopinride, a pair of BigML users with a really interesting application. Hope you enjoy it!
For those of you who are wondering how multivariate time-series EEG data is only five-dimensional in the dataset shown, you should know that Tex and Brand preprocessed the data using a discrete wavelet decomposition, took aggregate statistics over the wavelet subbands, then performed kernel PCA. They’re not as simple as they make themselves out to be.
P.S. – Thanks to Footage Island!
We’re going to take a momentary departure from our typical blog and reflect on some progress that we’ve made to date and to tell you of some big things to expect down the road — and also share the opportunity to take part in our Early Adopter Program (details are at bottom of this post, so read on!).
For starters, we’re really excited that we’ve grown to have over 3,000 registered customers for BigML — the vast majority of which have joined since our public launch late last year. We’re super excited about this growth rate as it clearly demonstrates that there’s a palpable need out there for easy-to-use machine learning services. And not only are there a lot of you, but you’re doing a whole bunch of work with our system: you’ve created thousands of datasets (10,000+) and models (13,000+).
We like to think that a lot of this growth is due to the fact that our team continues to innovate at a furious pace — just since our October 23 launch we’ve brought the following functionality out in BigML:
- Interactive filters to find interesting patterns.
- Integration with Azure data market.
- The launch of our public gallery for predictive models.
- Clojure library for BigML API.
- Addition of Evaluations in BigMLer, a command line tool for Machine Learning.
- Inline sources to quickly test models in development mode.
- Myriad UI updates and changes to facilitate ease of use.
At BigML we take advantage of many fantastic Clojure open source libraries. We wanted to do our part to support this community, so we released our Streaming Histograms for Clojure and Java, and also our Random Sampling Library for Clojure.
And we’re not done — for those of you using the API you know that we already support ensembles, including random decision forest — one of the machine learning techniques widely recognized as a top performer across a multitude of domains. This will soon also be exposed through our web interface, meaning that with just a few clicks you’ll be able to upload a data source, create a dataset, and create an ensemble to make robust predictions that just a few years ago would literally cost you hundreds of thousands of dollars worth of hardware, software licenses and manpower. We think this is pretty cool, to say the least — and hope you feel the same way.
Some other features and activities you can look forward to will include:
- Subscription-based pricing plans so you can use BigML as much as you want for a fixed price
- Virtual Private Cloud for companies with data policy requirements and/or intensive use-case requirements
- Ability to integrate BigML visualizations into your own website or application
What are we missing? If you’re a current BigML user you may be a candidate for our Early Adopter Program, which we are launching to work with a handful of customers to test-drive and provide feedback on our latest features before they are published for everyone.
As a novice in this machine learning field, I managed to make a lot of mistakes. But I was fortunate to have knowledgable and patient colleagues who guided me on the straight and narrow path to predictive success (like Charlie Parker, who co-authored this post). Here are 3 machine learning mistakes I managed to make more than once. I hope it’ll help you to recognize and understand them when you happen to encounter them using BigML.
There’s a sense of anticipation as you watch a predictive model being built, layer by layer in BigML’s web interface. But for some of my datasets, it would stop at the root node, leaving me with a perfectly useless 1-node decision tree. I remember filing it as a bug the first time it happened!
It turned out that I had used a skewed dataset. The classes for my objective field were not balanced in the training set. If for instance 99% of my training set would have “TRUE” as class for it’s objective feature, the model would simply predict “TRUE” in all cases and be 99% accurate. It’s a mistake that is simple to remedy, if you have enough data: Simply throw out some of the data from the dominant class. Be careful not to throw out too much, though (you don’t want to throw away useful data) and be sure to pick the points you throw away at random, otherwise you might introduce some bias.
How much should you throw away? It’s a bit of a guessing game, but sometimes the relative importance of the two classes can provide an insight: Say it’s 10 times more important to you to get the “FALSE” examples right than the “TRUE” examples. A good place to start might be to throw as much data away as you need to to make the “FALSE” class 10 times more likely than it was in the original dataset. So if the dataset is 1% “FALSE”, throw out “TRUE” examples until it is 10% “FALSE”.
Can’t afford to throw away data? Another similar strategy is just to replicate the “FALSE” datapoints until you have the desired number (in the example above the strategy would be to replicate each point 10 times). This is another way of telling the modeling process that the “FALSE” points are more important than the “TRUE” points, and they can’t just be ignored without a significant performance decrease.
One Feature Spoils the Party
I remember working really hard on a dataset that contains property data and house prices in Manhattan. Every instance had both the actual sales price and an indexed price. This way I could compare the sales prices over time. I was excited when I finally uploaded the dataset and was quick to push the 1-click model option. The resulting tree was disappointing, to say the least. It predicted the indexed price but mainly had ‘price’ as feature in the tree. After all, knowing the price is a perfect indicator of the indexed price! In my enthusiasm to create a model, I had forgotten to exclude ‘price’ and ended up with (again) a perfectly true but useless model.
This happened to me more than once, where one feature would be a good or even perfect predictor for the objective feature. Another good example is datasets that have both the total value for the objective, and some kind of average value as well; they’re usually not perfect proxies for each other, but often you can predict one from the other with pretty high accuracy. Usually, that’s not helpful as you’re not going to know either one ahead of time, so a predictor for one from the other is useless.
The cure is again quite simple. If you find such a relationship in your dataset, simply deselect the feature that is bothering you and run your analysis again.
Too Many Classes
While the previous mistakes are quite obvious and render useless models, the last mistake took me longer to appreciate. Quite proudly I would share a model with one of my colleagues only to hear the experts say ‘too big’ or ‘too wide’ (always in a constructive and respectful tone, of course!). How can a tree be too big or too wide?
One particular feature that occurs a lot in US datasets is for instance ‘State’. Since there are 50 US states, a node that splits on ‘State’ could split into 50 different branches. If that happens more than once in a tree, you can get a very wide tree indeed. A similar feature would be ‘country’ if you use international data. If your tree splits on such a feature with a lot of classes, the data is divided after that split into a lot of small buckets and the support for each following decision is getting lower and lower. Once the data gets small (at the lower leaves of the tree), splitting the data 50 ways just isn’t useful anymore. Said another way, if you’ve got 50 categories, you need a whole lot of data to fill them up.
BigML limits the number of categories you can use for a feature to 300. If the feature has more categories than that it is considered a text item and is automatically deselected from the analysis. Even if your feature has more than 35 categories, it is automatically deselected but still labeled categorical. This allows you to override this deselection by manually selecting this feature to include it in your analysis.
One way of improving the performance of trees with features that have many classes is to bundle classes. For instance, in stead of ‘state’ you can add a feature called ‘region’ to limit the number of classes. Likewise for ‘country’ you can add ‘continent’. This can go all sorts of ways. Perhaps “coastal” and “non-coastal” or “primarily urban” and “primarily rural”. In machine learning speak, this is called ‘feature engineering’ and it is a crucial technique to improve the performance of your model. In any case, a five-category feature, bundled in a useful way, is almost always more useful than a raw 50-category feature. But if you do think ‘State’ is of high importance and you have enough data per state, you can consider splitting the training set per state. This way you’ll create a predictive model per state and be able to find differences per state in the predicted results.
You learn from your mistakes. The first step is to recognize them and know how to remedy them. Even when using a tool like BigML, it is quite possible to make mistakes like these. I am fortunate to have my machine learning colleagues around to help me. If you run into similar issues and you are left clueless, make sure to let us know. We are not in the business of machine learning consultancy or training but we do want you to be successful in applying BigML’s machine learning platform.
We here at BigML are big fans of ensemble algorithms (and Ronald Reagan movies). Using them, a simple model like a BigML decision tree can be leveraged into a very high-powered predictor. During my regular surveys of machine learning research literature, however, I’ve noticed that a very popular class of ensemble algorithms, boosters, has been getting some bad press lately, and I thought I’d offer our readers a brief synopsis.
The tl;dr for this post is absolutely not “boosting is bad”. Adaptive boosting still gives excellent performance in a ton of important applications. In the cases where it fails, another type of ensemble algorithm, even another type of booster, usually succeeds. The overall moral is more about complexity. Boosting is a more complex way of creating an ensemble than are the two types we provide at BigML: Bootstrap Aggregation and Random Decision Forests. We use the simpler methods because they are faster, more easily parallelized, and don’t have some of the weaknesses that boosting has. Sometimes, less is more in machine learning.
A final caveat before I begin: This post is fairly technical. If you’re not up for a deep dive into some machine learning papers at the moment, you might want to check this out.
Learning to Rank
The first paper I’ll mention is Killian Weinberger’s paper on the Yahoo! Learning to Rank Challenge. Not surprisingly, the paper is about learning to rank search query results by relevance. One of the surprising results is that boosting is outperformed by random forests for a variety of parameter settings. This is surprising because gradient boosted regression trees are one of the commercially deployed algorithms in web search ranking.
The authors manage to “fix” boosting by using a two-phase algorithm in which the gradient booster is initialized with the model generated by the random forest algorithm. Still, however, the fact that plain ol’ boosting gets beat by random forests in such an important and common application is tough to ignore.
Bayesian Model Combination
Another recent paper that I like a lot (and that hasn’t gotten the attention it deserves) is Kristine Monteith and Tony Martinez’s Paper on Bayesian Model Combination. In this paper, we see that boosting outperforms bootstrap aggregation slightly when performance is averaged over a large number of public datasets. This is expected.
However, the authors then try a fairly simple augmentation of the bagged model: They essentially choose random weighted combinations of the models learned by bagging and pick several of the best performing combinations to compose their final model (note that this is a bit of an oversimplification; the variety of randomness they use is a good deal fancier than your garden-variety coin-flipping randomness, and “choosing the best performers” is also done in a clever fashion). They find that this simple step gives the bagged model a significant performance boost; enough to outperform boosting on average.
This is particularly hard on boosting because the final output of boosting is more or less the same as the final output of the weighted, bagged model; that is, a weighted combination of trees. The fact that we can produce weights better than boosting by selecting them at random suggests that the reason boosting is better than bagging is because it is allowed to weight its constituent models, and not because of the additional complexity used to choose those weights.
The last and scariest result comes from a paper that is close to five years old now by Phil Long and Rocco Servido. This paper is easily the most theoretical and mathematical of the bunch, but within all the math is a not-too-complex idea.
Suppose we have a training data set with a noise example. That is, a training data set where one of the instances is accidentally mislabeled. This is extremely common in data analysis: Maybe in your customer survey, someone clicked the “would recommend” button instead of the “would not recommend” button, or the technical support staff marked a permanently broken part as being “repaired”. In other words, data is rarely completely free of errors.
The authors show that, if you carefully construct a dataset, you can make certain types of boosters (ones that optimize convex potential functions, for those who care) produce models that perform no better than chance by adding just a single carefully selected noise instance. This is a pretty rough result for boosting. The fact that you can destroy an algorithm’s performance by moving a single training instance around means that the algorithm is a lot more fragile than we previously thought.
Now of course this should be taken with a grain of salt. The authors have gone through a lot of trouble to cause a lot of trouble for boosting, and maybe nothing like the data set they’ve constructed really exists in the real world. However, in my experience (and that of others cited in the paper), boosting really is fragile in some cases, and combined with the other two results above this fragility starts to come into clearer focus.
So what’s the point? I still like boosting. It’s still got a lot of good theory and even more good empirical results behind it. You’re probably using boosted models every day without even knowing it. But the three results above show that even a great tool like boosting can’t escape the No Free Lunch Theo- rem: Boosting provides performance benefits at times by using additional algorithmic complexity, but that additional complexity causes weaknesses at other times
Those weaknesses can sometimes be remedied by using techniques that are simpler and faster, and we’re always looking for such remedies at BigML. As the great E. W. Dijkstra said, “The lurking suspicion that something could be simplified is the world’s richest source of rewarding challenges.”
There’s a lot of hype these days around predictive analytics, and maybe even more hype around the topics of “real-time predictive analytics” or “predictive analytics on streaming data”. Like most things that are over-hyped, what is actually meant by the term is often lost in the noise. In this case that’s really a shame, because these terms refer to at least two different things, either or both of which may be important in a given context.
What is Machine Learning from Streaming Data?
Generally, when I hear people talking about “machine learning from streaming data”, they may be talking about a couple of things.
- They want a model that takes into account recent history when it makes its predictions. A good example is the weather; If it has been sunny and 80 degrees the last two days, it is unlikely that it will be 20 and snowing the next day.
- They want a model that is updatable. That is, they want their model to in some sense “evolve” as data streams through their infrastructure. A good example might be a retail sales model that remains accurate as the business gets larger.
These two phenomena sound like the same thing, but they are potentially very different. The central question is whether the underlying source generating the data is changing. In the case of the weather, it really isn’t (okay, okay): Given the weather from the previous few days you can usually make a pretty good guess at the weather for the next day, and your guess, given recent history will be roughly the same from year to year. The same model for last year will work for this year.
In the case of the business, the underlying source is changing; the business is growing, and so your guess of the sales given the previous few days of sales is probably going to be different from last year. So last year’s data, when the business was small, is really not relevant to this year, when the business is large. We need to update the model (or scrap it completely and retrain) to get something that works.
The first case, where you want the prediction conditioned on history, I’m going to call time-series prediction. That problem deserves a post all its own, but it suffices to say that solutions to this problem revolve largely around feeding the prediction history to the model as input. That’s a massive oversimplification, but there’s plenty of information out there if you’re more interested.
The second case, where you need to update the model or retrain completely, is about dealing with non-stationarity, and that’s largely what the rest of this post is going to be about. But consider this first lesson learned: Time series prediction and non-stationary data distributions are two different problems.
When I think about the second case above, a couple of classes of approaches jump to mind:
- Incremental Algorithms: These are machine learning algorithms that learn incrementally over the data. That is, the model is updated each time it sees a new training instance. There are incremental versions of Support Vector Machines and Neural networks. Bayesian Networks can be made to learn incrementally.
- Periodic Re-training with a batch algorithm: Perhaps the more straightforward solution. Here, we simply buffer the relevant data and retrain our model “every so often”.
Note that any incremental algorithm can work in a batch setting (by simply feeding the input instances in the batch into the algorithm one after another). The reverse, however, isn’t trivially true. Many batch algorithms can only be made to work incrementally with significant work or power sacrifices, and some things just can’t be done.
So is the sacrifice worth it? Here are a couple of considerations to think about:
- Data Horizon: How quickly do you need the most recent datapoint to become part of your model? Does the next point need to modify the model immediately, or is this a case where the model needs to behave conditionally based on that point? If it is the latter, perhaps this is a time-series prediction problem rather than an incremental learning problem.
- Data Obsolescence: How long does it take before data should become irrelevant to the model? Is the relevancy somehow complex? Are some older instances more relevant than some newer instances? Is it variable depending on the current state of the data? Good examples come from economics; generally, newer data instances are more relevant. However, in some cases data from the same month or quarter from the previous year are more relevant than the previous month or quarter of the current year. Similarly, if it is a recession, data from previous recessions may be more relevant than newer data from a different part of the economic cycle.
With these two concerns in mind, along with architecture and implementation, you can get a pretty good idea of whether you’re looking at a problem where incremental learning is desirable or not.
Obviously, the shorter the data horizon, the more likely you are to want incremental learning. However, it’s a common mistake to confuse a short data horizon with a time-series prediction problem: If you want your model to behave differently based on the last few instances, the right thing to do is condition the behavior of the model on those instances. If you want the model to behave differently based on the last few thousand instances, you may want incremental learning.
This, however, is where the second concern rears its ugly head. Incremental learners all have built-in some parameter or assumption that controls the relevancy of old data. This parameter may or may not be modifiable and the relationship may be complex, but the algorithm will be making some implicit assumption about how relevant old data is. This is the second lesson: Be wary of the data relevance assumptions made by incremental learning algorithms!
By contrast, retraining in batches has lots of flexibility in this regard. It is easy to select data for retraining, filter by relevant criteria, even weight the data according to some relevancy function using one of the many batch training algorithms that take weighting into account. There’s even been some recent work in automatically detecting when retraining is necessary, based essentially on how different the incoming data is from previous recent data.
First of all, don’t confuse learning from streaming data with time series prediction; while the data sources from the two problems look similar, the two concerns are often orthogonal.
Incremental learning is great for two cases: First, simplicity. There’s no buffering and no explicit retraining of the model. Second, speed. You always have a model that’s up to date. You make sacrifices, however, in terms of the power of the model that you can learn and the flexibility of the model to incorporate old data to different degrees. There are also some corner cases where incremental learning is necessary, such as when data privacy demands that instances be discarded immediately after they are seen.
Periodic retraining requires more decisions and more complex implementation. However, you get all of the power of any supervised classification algorithm, and specialized tools can be built on top of it to allow you to retrain on only relevant data and only when necessary. It also offers the nice benefit of being able to plug-and-play different machine learning algorithms into your architecture with a minimum of hassle, as the learning bit is built completely from off-the-shelf algorithms.
BigML has mainly gone the second route, allowing you to easily upload new data and trigger retraining (either manually or via the BigML API) whenever you find it necessary. One thing on our to do list is to implement some of those specialized tools on top of the BigML API, so that data can be streamed to BigML and the model is automatically re-trained only when it needs to be. We’ll update you on our blog about any big changes!
EDIT: This article originally used the terms “classifier” and “predictor” more or less interchangeably, and they’re not really interchangeable as one astute reader pointed out. I’ve replaced both with the term “model”, which follows the convention we’ve been using at BigML. Sorry for any confusion that might have caused!