Our new version of the BigML backend has a lot of little improvements. One of these is the introduction of statistical pruning (or pessimistic pruning as seen in this slide deck).
“What on Earth is that?” Glad you asked.
To start out, let’s say we go to our favorite neighborhood restaurant this Friday. Once we get there, there’s a 30 minute wait. Not fun at all. The next night, on Saturday, we go again and there’s no wait at all! Huzzah! Being the intrepid data scientist that you are, you think, “Ah ha! On Fridays I always have to wait and on Saturdays I never do! I’m only going out Saturday from now on!”
Nothing wrong with that logic, right? Well, maybe a little. You’re trying to draw a conclusion from only two data points (Friday and Saturday). What if it was raining on Saturday? What if the circus (which, of course, is always filled with hungry acrobats) was in town on Friday? It might not be the day that’s the important factor, it could be something else, or it could be random. Of course, you have no way of knowing, because you only have two data points.
Now suppose you keep doing this, week after week, for years. Every Friday you see the same wait and every Saturday there’s no wait at all. That logic from above is starting to look pretty good, because you have more data to support it. Said another way, conclusions drawn from more data tend to be stronger. Statisticians sometimes quantify a prediction’s uncertainty using a tried and true technique called a confidence interval, which takes into account how much data you have to support the prediction.
We can use these statistical techniques to “prune” your decision tree for maximum accuracy. Think of it this way: Each node in your decision tree has two or more children. The children of the node are guaranteed by the decision tree learning algorithm to have better prediction accuracy on the training data than does the parent node. On the other hand, the children of the parent node will also have less data to support those predictions, and so the prediction will be more uncertain. If the increase in accuracy doesn’t offset the increase in uncertainty, we eliminate (prune) the child nodes and use the parent’s prediction instead.
The technical details here are that we’re using the Wilson score interval to estimate classification uncertainty and standard error plus variance to estimate regression uncertainty. There are other ways it could be done, but these work well in practice.
To give you more control over pruning, we’ve introduced a three-way selector in the model configuration panel (available while you’re viewing your dataset). The two options to the right are “pruning” and “no pruning”. The cryptic “ML” option on the left (which is the default) tries to balance pruning with the quality of the visualization that we provide: If a node represents a significant part of your data by percentage (more than half a percent or so) we won’t prune it away even if it might be a little shaky, just in case you’re interested in seeing it. If you’re interested in maximum predictive accuracy rather than the visualization, go with the “pruning” option.
So if your trees seem to have less nodes now, it’s just because we’re trying to get rid of nodes that are jumping to conclusions a little too eagerly. Build a model and try it out!
One comment