Everything You Wanted to Know About Machine Learning, But Were Too Afraid To Ask (Part One)
Recently, Professor Pedro Domingos, one of the top machine learning researchers in the world, wrote a great article in the Communications of the ACM entitled “A Few Useful Things to Know about Machine Learning“. In it, he not only summarizes the general ideas in machine learning in fairly accessible terms, but he also manages to impart most of the things we’ve come to regard as common sense or folk wisdom in the field.
It’s a great article because it’s a brilliant man with deep experience who is an excellent teacher writing for “the rest of us”, and writing about things we need to know. And he manages to cover a huge amount of ground in nine pages.
Now, while it’s very light reading for the academic literature, it’s fairly dense by other comparisons. Since so much of it is relevant to anyone trying to use BigML, I’m going to try to give our readers the Cliff’s Notes version right here in our blog, with maybe a few more examples and a little less academic terminology. Often I’ll be rephrasing Domingos, and I’ll indicate it where I’m quoting directly.
How Does Machine Learning Work?
We know that supervised machine learning models (the ones you build at BigML) can predict one field in your data (the objective field) from some or all of the others (the input fields). But how does the model building process actually work? All machine learning algorithms (the ones that build the models) basically consist of the following three things:
- A set of possible models to look thorough
- A way to test whether a model is good
- A clever way to find a really good model with only a few test
A good analogy here might be trying to find a really good restaurant in your neighborhood: Your set of possibilities is all of the restaurants in your neighborhood, and to test if the restaurant is good you can have a meal there.
Our third necessity is a bit more elusive; you want to find a good restaurant without having to eat at every single one. How best to be clever about it? Maybe you take a look at the menu, or the outside of the building, or the surrounding neighborhood. Maybe you ask some people you trust. In any case, you know a lot of tricks that will let you find a good restaurant without trying all of them. You may not find the best one; that would be a lot more work, and is probably not necessary. But you could probably find one that’s pretty good.
With machine learning, things are even more tricky: The set of possible models for any real machine learning algorithm is big. Really, really big. In many cases, it’s not even finite. Fortunately, we already know loads of clever tricks for making the search manageable (we are using many of them here at BigML!).
Once you’ve found a good model, “the proof is in the pudding”, as they say. We hope that the model we’ve found will also apply to data outside of your original dataset, so you can make predictions from the inputs when you don’t know the objective.
Unfortunately, even with this fairly simple idea, there are many ways that learning algorithms fail, and things you can do to make sure they don’t. Domingos visits several of the most common ones in his paper.
Overfitting Has Many Faces
The general moral of this section of the paper is to always measure the performance of your classifier on out-of-sample data. That is, to know if your model is good, you must do at least one test where you train on some of your data and test on the rest. Even if that test succeeds, to have real knowledge you must do several (where you split the data differently each time). If your data is placed in time somehow (weekly retail sales, for example), you might do a test where you train on all of the weeks except one, and test on that week, and you should do this for each of the weeks in your data.
You cannot do too many of these training and testing splits. You should even make some predictions on data you imagine yourself, to see what the model does in certain situations. It is only by doing this that you will understand how good your model is, and what sort of mistakes it makes when it makes them.
Domingos focuses on a particular way of analyzing the errors of your model, called bias-variance decomposition, but BigML provides many other ways of understanding the performance of your model on out-of-sample data. You can see them all when you create an evaluation.
Intuition Fails in High Dimensions
So you have a model, with some input fields, and some objective field, and it is not performing as well as you’d like. One of your first intuitions might be to add more input fields. After all, if the model can do as well as it does with the input fields you have, surely if you give it more information it will do better, right?
Not so fast. While this might work if you add a particularly useful input field (that is, one that helps you predict the objective very well), this often doesn’t work. In fact, if you add input fields that are not useful, or redundant ones that contain information already in other input fields, you may very well get a model that performs worse than the original. One reason is that the more input fields you have, the more likely that the model will see some relationship between one of them and the objective that isn’t real, but just the product of random noise. Of course, this will make the model think that field is important when it is not, and so the model will be the worse because of it.
What this means in practice is that as you add more and more input fields, you must also add more and more training data to “fill up” the space created by the additional inputs if you want to use them accurately.
This is known as the curse of dimensionality in machine learning, and it is only one problem with having too many input fields. You can combat this problem in your own modeling efforts by selecting as inputs only the fields you know are relevant to predicting the objective. You can do this by opening up the configuration panel when you create your model and deselecting the less relevant fields.
Theoretical Guarantees Are Not What They Seem
Many machine learning papers offer fancy mathematics that show that if your training set is a certain size, you can guarantee the error of your model will be better than some number. Often, people see these guarantees and think, “Wow, with such fancy math behind it, this algorithm must surely be better than all others”. The problem is that these guarantees are more often than not irrelevant in practice because the guaranteed number is so pathetic than any reasonable algorithm will hit it. In fact, some recent work showed that some of these theoretical guarantees are useless not only in practice, but even in theory.
In my machine learning career, I’ve seen a similar phenomenon with people becoming attached families of algorithms. That is, “Algorithm X” cannot be good because it is not a neural network” or “it is not a support vector machine” or what have you. For example, our illustrious CEO recalls a failed machine learning project at a major company. The reason for the failure? A high-level manager thought that the type of classifier being used was inferior to one he had heard about in an undergraduate class years before. Unfortunately, his lack of solid evidence didn’t translate into a lack of authority and the project was scrapped at a late stage, costing the company millions of dollars in wasted effort.
All such biases for and against algorithms are at best misleading. The only certain way (that we know of now) to know if an algorithm will model your data well is to try it out. While you may know some data-specific things that may help you select a machine learning algorithm before trying it out, you may be surprised by the results anyway. Come to the process with no preconceptions and you are likely to find the best answer.
Let’s take a breather. In the next post I’ll review the rest of the paper, outlining a few more of the things to do and the things to avoid when modeling your data. See you then!