In the first in this series of posts, I discussed a bit about why deep learning isn’t fundamentally different from the rest of machine learning, and why that lack of a difference implies many contexts in which deep networks will fail to perform at a human level of cognition, for a variety of reasons.
Is there no difference, then, between deep learning and the rest of machine learning? Is Stuart Russell right that deep learning is nothing more than machine learning with faster computers, more data, and a few clever tricks, and is Rich Sutton right that such things are the main drivers behind machine learning’s recent advances?
Certainly, there’s a sense in which that’s true, but there’s also an important sense in which it’s not. More specifically, deep learning as it stands now represents an important advance in machine learning for reasons mostly unrelated to the access to increasing amounts of computation and data. In the last post, we covered how understanding deep learning is the same as the rest of machine learning is the key to knowing some of the problems that deep learning does not solve. Here, let’s try to understand how deep learning is different and maybe along the way we’ll find some problems that it solves better than anything else.
How To Create a Machine Learning Algorithm
Before talking about machine learning, it’s important to know how existing machine learning algorithms have been created by the academic community. Oversimplifying dramatically, there’s usually a two-step process:
- Come up with some objective function
- Find a way to mathematically optimize that function with respect to training data
In machine learning parlance, usually, the objective function is basically any way of measuring the performance of your classifier on some data. So one objective function would be “What percent of the time does my classifier predict the right answer?”. When we optimize that objective, we mean that we learn the classifier’s internal workings so that it performs well on the training data by that measurement. If the measurement were “percent correct” as above, it means that we learn a classifier that gets the right answer on the training data all of the time, or as close to it as possible.
An easy, concrete example is ordinary least squares regression: The algorithm seeks to learn weights to minimize the squared difference between the model’s predictions and the truth in the data. There’s the objective function (the sum of the squared differences) and the method used to optimize it (ordinary least squares).
There are a lot of different variations on this theme. For example, most versions of support vector machines for classification are based on the same basic objective, but there are a lot of algorithms to optimize that objective, from quadratic programming approaches like the simplex and interior point methods to cutting plane algorithms, to sequential minimal optimization. Decision trees admit a lot of different criteria to optimize (information gain, Gini index, etc.), but given that criteria, the method of optimizing it is basically the same.
Importantly, these two steps don’t usually happen in sequence because one depends on the other: When you’re thinking of an objective function, you immediately put aside ones that you know you can’t mathematically optimize, because there’s really no use for such a function that you can’t optimize against the data that you have. It turns out that the majority of objective functions you’d like can’t be efficiently optimized mathematically, and in fact, much of the machine learning academic literature is devoted to finding clever objective functions or new and faster ways of optimizing existing ones.
From Algorithms to Objectives
Now we can understand the first-way deep learning is rather different from the usual way of doing machine learning. In deep learning, we typically use only one basic way of optimizing the objective function, and that’s using a family of algorithms known collectively as gradient descent. Gradient descent means a bunch of different things, but they all rely on knowing the gradient of the objective function with respect to every parameter in the model. This, of course, means calculus and calculus is hard.
In deep networks, this is even harder because of the variety of things like activations functions, topologies, and types of connections. The programmer needs to know the derivative of the objective with respect to every parameter in every possible topology in order to optimize it, then they also need to know the ins and outs of all of the gradient descent algorithms they want to offer. Engineering-wise, it’s a nightmare!
The saving grace here is that the calculus is very mechanical: Given a network topology in all of its gory detail, the process of calculating the gradients is based on a set of rules and it’s not a difficult calculation, just massive and tedious and prone to small mistakes. So a whole bunch of somebodies finally buckled down, got the collection of rules together and turned that tedious process into computer code. The upshot is that now you can use programming to specify any type of network you want, including some objective, then just pass in the data and all of the calculus is figured out for you.
This is called automatic differentiation, and it’s the main thing that separates deep learning frameworks like Theano, Torch, Keras, and so on from the rest of computation. You just have to be able to specify your network and your objective function in one of these frameworks, and the software “knows” how to optimize it.
Put another way, instead of telling the computer what to do you’re now just telling it what you want.
More Than What You Need
It’s hard to overstate how much more flexible this is as an engineering approach than what we used to do. Before deep learning, you’d look at the rather small list of problems that machine learning was able to solve and try to cram your specific problem into one of those bins. Is this a classification or regression problem? Use boosting or SVMs. Is it a clustering problem? Use k-means or g-means. Is it metric learning? LMNN or RCA! Collaborative filtering? SVD or PLSI! Label sequence learning? HMMs or CRFs or even M3Ns! Depending on what your inputs and desired outputs look like, you choose an acronym and off you go.
This works well for a lot of things, but for others, there’s just not a great answer. What if your input is a picture and your output is a text description of that picture? Or what if your input is a sequence of words and your output is that sequence in a different language? Before, you’d say things like, “Well, even though it’s sequences, this is kind of a classification problem, so we can preprocess the input with a windowing function, then learn a classifier, then transform the outputs, then learn another classifier on those outputs, then apply some sort of smoothing” and so on. The alternative to this sort of shoehorning the problem into an existing bag was to design a learning algorithm from scratch, but both options come to much the same thing: A big job for someone with loads of domain knowledge and machine learning expertise.
With a deep learning framework, you just say “this is the input, here’s what the output should be, and here’s how to measure the difference between the model’s output and what it should be. Give me a model.” Does this always work? Certainly not; all of those acronyms are not going away anytime soon. But the fact that deep learning can often be made to work for problems of any of the types above with very little in the way of domain knowledge or mathematical insight, that’s a powerful new capability.
Perhaps you can also see how much more extensible this is vs. the previous way of doing things. Any operation that is differentiable can be “thrown onto the pile”, its rules added to the list of rules for differentiation and then can become a part of any network structure. One very important operation currently on the pile is convolution, which is the current star of the deep learning show. This isn’t a scientific innovation; that convolution kernels can be learned via gradient descent is almost 30 year-old news, but in the context of an engine that can automatically differentiate the parameters of any network structure, you end up using them in combination with things like residual connections and batch normalization which pushes their performance to new heights.
Maybe just as important as the flexibility of deep learning frameworks is the fact that the gradient descent typically happens in tiny steps, where a small slice of the training data updates the classifier at each one. This may not seem like a big deal, but it means that you can take advantage of effectively an infinite amount of training data. This allows you to use a simulator to augment your training data and make your classifier more robust, which is a natural fit in areas like computer vision and game playing. Other machine learning algorithms have the same sort of update behavior, but given the complexity of the networks needed to solve some problems and the amount of data needed to properly fit them, data augmentation becomes a game-changing necessity.
You’re probably starting to realize now that “Deep Learning” isn’t really a machine learning algorithm, per se. Closer to the truth is that it’s a language for expressing machine learning algorithms, and that language is still getting more expressive all the time. This is partially what our own Tom Dietterich means when he says that we don’t really know what the limits are for deep learning yet. It’s tough to see where the story ends if only because its authors are still being provided with new words they could say. To say that something will never be expressible in an evolving language such as this one seems premature.
Now, even considering the huge caveat that is the first post in this series, I’d like to put forth a couple of additional grains of salt. First, the above narrative gives the impression of complete novelty to the compositional nature of deep learning, but that’s not entirely so. There have been previous attempts in the machine learning literature to create algorithmic frameworks of this sort. One such attempt that comes to mind is Thorsten Joachims’ excellent work developing SVM-struct, which does for support vector machines much of what automatic differentiation does for deep learning. While SVM-struct does allow you to attack a diverse set of problems, it lacks the ability to incorporate new operators that constantly expand the space of possible internal structures for the learned model.
Second, admittedly, I may have oversimplified things a bit. The complexity of creating or selecting a problem-specific machine learning algorithm has not disappeared entirely. It’s just been pushed into the design of the network architecture: Instead of having a few dozen off the shelf algorithms that you must understand, each with five or ten parameters, you’ve now got the entire world of deep learning to explore with each new problem you encounter. For problems that already have well-defined solutions, deep learning’s flexibility can be more of a distraction than an asset.
The Proof Is In The Pudding
All of that said, it would be silly to talk about any of this if it hadn’t led to some absurd gains in performance in marquee problems. I’m old enough to have been a young researcher in 2010, when state-of-the-art results were claimed on CIFAR-10 at just over 20% error. Now, multiple papers claim error rates under 3% and even more claim sub-20% error on ImageNet, which has 100 times as many classes. We see similar sorts of improvement in object detection, speech recognition, and machine translation.
In addition, this formalism allows us to apply machine learning directly to a significantly broader set of possible problems like question answering, or image denoising, or <deep breath> generating HTML from a hand-drawn sketch of a web page.
Even though I saw that last one for the first time a year ago, it still makes me a bit dizzy to think about it. To someone in the field who spent countless hours learning how to engineer the domain-specific features needed to solve these problems, and countless more finessing classifier outputs into something that actually resembled a solution, seeing these algorithms work, even imperfectly, borders on a religious experience.
Deep learning is revolutionary from a number of angles, but most of those are viewable primarily from the inside of the field rather than the outside. Deep learning as it is now is not an “I’m afraid I can’t do that”-level advance in the field. But for those of us slaving away in the trenches, it can look very special. Even the closing barb of Gary Marcus’ synced interview has a nice ring to it:
They work most of the time, but you don’t know when they’re gonna do something totally bizarre.
Really? Machine learning? Working most of the time? It’s music to my ears! I think Marcus is talking about driverless cars here, but roughly the same thing could be said of state-of-the-art speech recognition or image classification. The quote is an unintentional compliment to the community; Marcus doesn’t seem to be aware of how recent this level of success is, how difficult it was to get here, and how big of a part deep learning played in the most recent ascent. Why yes, it is working most of the time! Thank you for noticing!
With regard to the second half of that quote, clearly there’s work left to be done, but what does the rest of that work look like and who gets to decide? In the final post in this series, I’ll speculate on what the current arguments about machine learning say about its use and its future. So stay tuned…