The Three Cardinal Virtues of Ensemble Learning

Posted by

Ensemble algorithms.  Why do they work? It seems like kind of a crazy idea: Let’s take our original learning problem, create a bunch of new problems that aren’t quite the same but very related, and learn a model on each new problem. Then, when we want to classify a new point, we classify it using all of these models and combine predictions, usually using some form of voting.

Why bother with all this? Why not learn a single model over the original problem and be done with it?

Our Chief Scientist, Professor Tom Dietterich, has an older but excellent and still relevant tutorial paper that addresses this very question (you might also be interested in some of his other tutorials).  As with our previous review of an interesting piece of machine learning literature, this paper is fairly lightweight for the academic literature but somewhat heavy compared to your average nancy drew mystery.  So I’ll give you a little rundown here that should be a little easier to follow for the non-experts.

In that previous paper review, I talked a little bit about how machine learning algorithms looking for good models are a bit like people trying to find a good restaurant.  People aren’t going to eat at every restaurant in an attempt to find the best one.  They’re going to do things like ask friends and look at menus to filter out obvious losers, then eat at the ones they think have a pretty good chance of being good.

Now, if we extend that analogy, an ensemble learning algorithm is like a group of people all looking for a good restaurant in the same area, and getting together after their search to vote on the best one.  Dietterich identifies three important reasons that this can be better than one person looking on her own.  I like to think of these reasons as being related to three of the cardinal virtues in greek philosophy (mostly as one of my many attempts to amuse myself).

Temperance

temperance

One person (whom we’ll call our singleton restaurant critic) is trying determine the best restaurant in the city.  She narrows things down to, say, five really great restaurants.  Among these great restaurants, the differences are so small that even a misplaced fork can change the critic’s decision.

The problem here is that our critic only has one vote, and must pick one, even though all seem equally good.  If we had many critics, each eating a meal at each one, the differences might become more obvious.  Maybe that fork is misplaced every night.  Or maybe it’s not, but occasionally something is way over-cooked, or someone gets food poisoning.  This will introduce votes against that restaurant into the ensemble, and finally make it less likely to be chosen in the aggregate over the restaurants that don’t make such mistakes.

This is the statistical advantage described by Dietterich.  One problem produces only one guess at the correct model, and if there are many models that look good given the training data, we have an “all of our eggs in one basket”-style risk of picking the wrong model.  By defining many related problems and voting on the correct answers, we reduce this risk of being too hasty in our choice of model.

Prudence

prudence

Who is our singleton restaurant critic? Probably, as a human being, they have tastes like anyone else. Suppose she has a taste for Chinese food. This person might label a Chinese restaurant as the best, even if there’s a slightly better burrito place around the corner, just because of personal preference.

When you’ve got a group of critics, this is a lot less likely to happen. Some people might like Chinese food best, or hamburgers, or soup, but if all of them put the burrito place in second, you’ve got a pretty good idea that the burritos have the highest overall appeal.

This is the representational advantage for ensembles; even if it’s not in the “vocabulary” of the constituent models to find a great solution, a much better solution may well be in the vocabulary of averages of the constituent models. It’s the machine learning version of the wisdom of the crowd: While any individual might be foolish, the aggregate can sometimes make the right call with remarkable accuracy.

Fortitude

fortitude

Our singleton restaurant critic has heard from one of her friends that the best food in town is in three neighborhoods. As her money and time are limited, she decides that, rather than look all over the city, she’s going to focus her search on these three neighborhoods.

But it turns out that, while this is true, there’s someone doing something amazing with gourmet hot dogs in another part of the city. An army of searchers, all with different friends, is likely to encounter that restaurant at least a few times.

This is what Dietterich calls the computational advantage for ensembles. Some learning algorithms may make decisions early on in their search that lead them to get stuck in local minima, meaning they find a solution that looks very good compared to all the ones they’ve seen, but there’s a better solution lurking elsewhere.  By creating slightly different versions of the problem (different sets of friends, say), you make it likely that at least some of your critics will find that lurking solution.  While individual learners are too slothful to explore the entire search space, an army does a better job.

The Rest of the Story

So we know it’s good to define multiple related learning problems and vote the answers.  How do we do that?  Section Two of Dietterich’s paper goes over some of the best ways we know.  Section three gives some brief experiments that show some of the weaknesses of some of those methods that we’ve mentioned in this space before.  I won’t go over the rest of the paper here, but I do encourage people interested in creating these ensembles to read that section as it’s an excellent introduction for those looking to go a little deeper into ensembles.

At BigML, you can already create some types of these ensembles via the command line, and we’ll soon bring this to the interface.  Stay tuned!

2 comments

Leave a comment