Everything You Wanted to Know About Machine Learning, But Were Too Afraid To Ask (Part One)

Recently, Professor Pedro Domingos, one of the top machine learning researchers in the world, wrote a great article in the Communications of the ACM entitled “A Few Useful Things to Know about Machine Learning“. In it, he not only summarizes the general ideas in machine learning in fairly accessible terms, but he also manages to impart most of the things we’ve come to regard as common sense or folk wisdom in the field.

It’s a great article because it’s a brilliant man with deep experience who is an excellent teacher writing for “the rest of us”, and writing about things we need to know. And he manages to cover a huge amount of ground in nine pages.

Now, while it’s very light reading for the academic literature, it’s fairly dense by other comparisons. Since so much of it is relevant to anyone trying to use BigML, I’m going to try to give our readers the Cliff’s Notes version right here in our blog, with maybe a few more examples and a little less academic terminology. Often I’ll be rephrasing Domingos, and I’ll indicate it where I’m quoting directly.

How Does Machine Learning Work?

How Does Machine Learning Work?

We know that supervised machine learning models (the ones you build at BigML) can predict one field in your data (the objective field) from some or all of the others (the input fields). But how does the model building process actually work? All machine learning algorithms (the ones that build the models) basically consist of the following three things:

A set of possible models to look thorough
A way to test whether a model is good
A clever way to find a really good model with only a few test

A good analogy here might be trying to find a really good restaurant in your neighborhood: Your set of possibilities is all of the restaurants in your neighborhood, and to test if the restaurant is good you can have a meal there.

Our third necessity is a bit more elusive; you want to find a good restaurant without having to eat at every single one. How best to be clever about it? Maybe you take a look at the menu, or the outside of the building, or the surrounding neighborhood. Maybe you ask some people you trust. In any case, you know a lot of tricks that will let you find a good restaurant without trying all of them. You may not find the best one; that would be a lot more work, and is probably not necessary. But you could probably find one that’s pretty good.

With machine learning, things are even more tricky: The set of possible models for any real machine learning algorithm is big. Really, really big. In many cases, it’s not even finite. Fortunately, we already know loads of clever tricks for making the search manageable (we are using many of them here at BigML!).

Once you’ve found a good model, “the proof is in the pudding”, as they say. We hope that the model we’ve found will also apply to data outside of your original dataset, so you can make predictions from the inputs when you don’t know the objective.

Unfortunately, even with this fairly simple idea, there are many ways that learning algorithms fail, and things you can do to make sure they don’t. Domingos visits several of the most common ones in his paper.

Overfitting Has Many Faces

Overfitting Has Many Faces

The general moral of this section of the paper is to always measure the performance of your classifier on out-of-sample data. That is, to know if your model is good, you must do at least one test where you train on some of your data and test on the rest. Even if that test succeeds, to have real knowledge you must do several (where you split the data differently each time). If your data is placed in time somehow (weekly retail sales, for example), you might do a test where you train on all of the weeks except one, and test on that week, and you should do this for each of the weeks in your data.

You cannot do too many of these training and testing splits. You should even make some predictions on data you imagine yourself, to see what the model does in certain situations. It is only by doing this that you will understand how good your model is, and what sort of mistakes it makes when it makes them.

Domingos focuses on a particular way of analyzing the errors of your model, called bias-variance decomposition, but BigML provides many other ways of understanding the performance of your model on out-of-sample data. You can see them all when you create an evaluation.

Intuition Fails in High Dimensions

So you have a model, with some input fields, and some objective field, and it is not performing as well as you’d like. One of your first intuitions might be to add more input fields. After all, if the model can do as well as it does with the input fields you have, surely if you give it more information it will do better, right?

Not so fast. While this might work if you add a particularly useful input field (that is, one that helps you predict the objective very well), this often doesn’t work. In fact, if you add input fields that are not useful, or redundant ones that contain information already in other input fields, you may very well get a model that performs worse than the original. One reason is that the more input fields you have, the more likely that the model will see some relationship between one of them and the objective that isn’t real, but just the product of random noise. Of course, this will make the model think that field is important when it is not, and so the model will be the worse because of it.

What this means in practice is that as you add more and more input fields, you must also add more and more training data to “fill up” the space created by the additional inputs if you want to use them accurately.

This is known as the curse of dimensionality in machine learning, and it is only one problem with having too many input fields. You can combat this problem in your own modeling efforts by selecting as inputs only the fields you know are relevant to predicting the objective. You can do this by opening up the configuration panel when you create your model and deselecting the less relevant fields.

Theoretical Guarantees Are Not What They Seem

Many machine learning papers offer fancy mathematics that show that if your training set is a certain size, you can guarantee the error of your model will be better than some number. Often, people see these guarantees and think, “Wow, with such fancy math behind it, this algorithm must surely be better than all others”. The problem is that these guarantees are more often than not irrelevant in practice because the guaranteed number is so pathetic than any reasonable algorithm will hit it. In fact, some recent work showed that some of these theoretical guarantees are useless not only in practice, but even in theory.

In my machine learning career, I’ve seen a similar phenomenon with people becoming attached families of algorithms. That is, “Algorithm X” cannot be good because it is not a neural network” or “it is not a support vector machine” or what have you. For example, our illustrious CEO recalls a failed machine learning project at a major company. The reason for the failure? A high-level manager thought that the type of classifier being used was inferior to one he had heard about in an undergraduate class years before. Unfortunately, his lack of solid evidence didn’t translate into a lack of authority and the project was scrapped at a late stage, costing the company millions of dollars in wasted effort.

All such biases for and against algorithms are at best misleading. The only certain way (that we know of now) to know if an algorithm will model your data well is to try it out. While you may know some data-specific things that may help you select a machine learning algorithm before trying it out, you may be surprised by the results anyway. Come to the process with no preconceptions and you are likely to find the best answer.

Stay Tuned

Let’s take a breather. In the next post I’ll review the rest of the paper, outlining a few more of the things to do and the things to avoid when modeling your data. See you then!

20 comments

BB says:

February 16, 2013 at 3:06 am

What are some good resources for a beginner (to ML, not software dev) to learn ML to the point where he can start using it in non-trivial applications?

1. bigmlcom says:
  
  February 16, 2013 at 4:29 am
  
  This is a great class: https://www.coursera.org/course/ml and this a great introductory book: http://www.amazon.com/Machine-Learning-Algorithmic-Perspective-Recognition/dp/1420067184
  
  1. Franklin says:
    
    February 21, 2013 at 4:58 am
    
    Andrew Ng’s course was fantastic, indeed. Very clear explanations.
  2. Nicoleta says:
    
    February 22, 2013 at 11:35 am
    
    I just enrolled this class few days ago, I cannot wait it to start!
Pingback: Everything You Wanted to Know About Machine Learning, But Were Too Afraid To Ask (Part Two) « The Official Blog of BigML.com
Phi says:

February 22, 2013 at 9:40 am

The thing I’ve wondered is how do chatbots work? How do you provide textual input to AI, which only takes and outputs numerical data?

1. charleslparker says:
  
  February 22, 2013 at 1:08 pm
  
  Hey Phi – while the level of sophistication of chatbots varies wildly depending on the programmer, a lot of the good ones rely on techniques in a subfield of AI called “natural language processing“. Many of these techniques do exactly what you are thinking: Convert textual input to numerical data. There are dozens of ways of doing this, and each has a particular usefulness. We’re working right now on adding some into our backend at BigML.
  
  There are some Coursera courses available for the very, very interested. It looks like Michael Collins is offering something this month, and he is well-known as a top expert in the field.
  
Abdullahi T. R. says:

February 22, 2013 at 12:25 pm

Interesting!

Pingback: Everything You Wanted to Know About Machine Learning… « Another Word For It
Richard says:

March 1, 2013 at 8:44 am

Great post.
Does bigml provide tools to use part of my data set for training and another part as testing? Or generate the model for data and then run predictions on another set?
Thanks

1. charleslparker says:
  
  March 1, 2013 at 1:39 pm
  
  Hi Richard – yes, we do, though it requires a few clicks right now. If you want to create model on, say, 80% of the dataset and evaluate on 20%:
  
  1.) When you create the model, open up model configuration/sampling configuration, and select a rate of 80%.
  
  2.) When you do the evaluation, select the same dataset you used for the model, again set the sampling rate to 80%, and under “advanced sampling”, set:
  
  Sampling = deterministic
  Replacement = No
  Out-of-bag = Yes
  
  The “out-of-bag” parameter means that you’ll be evaluating on the instances you didn’t see in training.
  
  Because this is such a common case, we’re thinking of ways we can enable easier access to it in the interface, but this works fine in the short term. Thanks for your interest!
  
Pingback: Weekly Round-Up: House of Cards, Machine Learning, Lying, and the Internet of Things - Data Community DC
Pingback: .NET i jiné ... : Odkazy z prohlížeče – 7.3.2013
Pingback: The Three Cardinal Virtues of Ensemble Learning | The Official Blog of BigML.com
Pingback: Machine Learning is coming | ler 3.0
Pingback: A Simple Machine Learning Method to Detect Covariate Shift | The Official Blog of BigML.com
Pingback: Machine Learning: Links, News And Resources (3) | Angel "Java" Lopez on Blog
Pingback: Bigdata y redes sociales (II) | LuKasnet Blog
Pingback: Feels, reals and algorithms | Chris von Csefalvay
Pingback: Celebrating BigML’s 10th Anniversary – The Official Blog of BigML.com – Data Science Dose