We here at BigML are big fans of ensemble algorithms (and Ronald Reagan movies). Using them, a simple model like a BigML decision tree can be leveraged into a very high-powered predictor. During my regular surveys of machine learning research literature, however, I’ve noticed that a very popular class of ensemble algorithms, boosters, has been getting some bad press lately, and I thought I’d offer our readers a brief synopsis.
The tl;dr for this post is absolutely not “boosting is bad”. Adaptive boosting still gives excellent performance in a ton of important applications. In the cases where it fails, another type of ensemble algorithm, even another type of booster, usually succeeds. The overall moral is more about complexity. Boosting is a more complex way of creating an ensemble than are the two types we provide at BigML: Bootstrap Aggregation and Random Decision Forests. We use the simpler methods because they are faster, more easily parallelized, and don’t have some of the weaknesses that boosting has. Sometimes, less is more in machine learning.
A final caveat before I begin: This post is fairly technical. If you’re not up for a deep dive into some machine learning papers at the moment, you might want to check this out.
Learning to Rank
The first paper I’ll mention is Killian Weinberger’s paper on the Yahoo! Learning to Rank Challenge. Not surprisingly, the paper is about learning to rank search query results by relevance. One of the surprising results is that boosting is outperformed by random forests for a variety of parameter settings. This is surprising because gradient boosted regression trees are one of the commercially deployed algorithms in web search ranking.
The authors manage to “fix” boosting by using a two-phase algorithm in which the gradient booster is initialized with the model generated by the random forest algorithm. Still, however, the fact that plain ol’ boosting gets beat by random forests in such an important and common application is tough to ignore.
Bayesian Model Combination
Another recent paper that I like a lot (and that hasn’t gotten the attention it deserves) is Kristine Monteith and Tony Martinez’s Paper on Bayesian Model Combination. In this paper, we see that boosting outperforms bootstrap aggregation slightly when performance is averaged over a large number of public datasets. This is expected.
However, the authors then try a fairly simple augmentation of the bagged model: They essentially choose random weighted combinations of the models learned by bagging and pick several of the best performing combinations to compose their final model (note that this is a bit of an oversimplification; the variety of randomness they use is a good deal fancier than your garden-variety coin-flipping randomness, and “choosing the best performers” is also done in a clever fashion). They find that this simple step gives the bagged model a significant performance boost; enough to outperform boosting on average.
This is particularly hard on boosting because the final output of boosting is more or less the same as the final output of the weighted, bagged model; that is, a weighted combination of trees. The fact that we can produce weights better than boosting by selecting them at random suggests that the reason boosting is better than bagging is because it is allowed to weight its constituent models, and not because of the additional complexity used to choose those weights.
The last and scariest result comes from a paper that is close to five years old now by Phil Long and Rocco Servido. This paper is easily the most theoretical and mathematical of the bunch, but within all the math is a not-too-complex idea.
Suppose we have a training data set with a noise example. That is, a training data set where one of the instances is accidentally mislabeled. This is extremely common in data analysis: Maybe in your customer survey, someone clicked the “would recommend” button instead of the “would not recommend” button, or the technical support staff marked a permanently broken part as being “repaired”. In other words, data is rarely completely free of errors.
The authors show that, if you carefully construct a dataset, you can make certain types of boosters (ones that optimize convex potential functions, for those who care) produce models that perform no better than chance by adding just a single carefully selected noise instance. This is a pretty rough result for boosting. The fact that you can destroy an algorithm’s performance by moving a single training instance around means that the algorithm is a lot more fragile than we previously thought.
Now of course this should be taken with a grain of salt. The authors have gone through a lot of trouble to cause a lot of trouble for boosting, and maybe nothing like the data set they’ve constructed really exists in the real world. However, in my experience (and that of others cited in the paper), boosting really is fragile in some cases, and combined with the other two results above this fragility starts to come into clearer focus.
So what’s the point? I still like boosting. It’s still got a lot of good theory and even more good empirical results behind it. You’re probably using boosted models every day without even knowing it. But the three results above show that even a great tool like boosting can’t escape the No Free Lunch Theo- rem: Boosting provides performance benefits at times by using additional algorithmic complexity, but that additional complexity causes weaknesses at other times
Those weaknesses can sometimes be remedied by using techniques that are simpler and faster, and we’re always looking for such remedies at BigML. As the great E. W. Dijkstra said, “The lurking suspicion that something could be simplified is the world’s richest source of rewarding challenges.”