Machine Learning Automation: Beware of the Hype!
There’s a lot of buzz lately around “Automating Machine Learning”. The general idea here is that the work done by a Machine Learning engineer can be automated, thus freeing potential users from the tyranny of needing to have specific expertise.
Presumably, the ultimate goal of such automations is to make Machine Learning accessible to more people. After all, if a thing can be done automatically, that means anyone who can press a button can do it, right?
Maybe not. I’m going to make a three-part argument here that “Machine Learning Automation” is really just a poor proxy for the true goal of making Machine Learning useable by anyone with data. Furthermore, I think the more direct path to that goal is via the combination of automation and interactivity that we often refer to in the software world as “abstraction”. By understanding what constitutes a powerful Machine Learning abstraction, we’ll be in a better position to think about the innovations that will really make Machine Learning more accessible.
Automation and Interaction
I had the good fortune to attend NIPS in Barcelona this year. In particular, I enjoyed the (in)famous NIPS workshops, in which you see a lot of high quality work out on the margins of Machine Learning research. The workshops I attended while at NIPS were each excellent, but were, as a collection, somewhat jarringly at odds with one another.
In one corner, you had the workshops that were basically promising to take Machine Learning away from the human operator and automate as much of the process as possible. Two of my favorites:
- Towards an Artificial Intelligence for Data Science – What it says on the box, basically trying to turn Machine Learning back around on itself and learn ways to automate various phases of the process. This included an overview of an ambitious multi-year DARPA program to come up with techniques that automate the entire model building pipeline from data ingestion to model evaluation.
- Bayesopt – This is a particular subfield in Machine Learning, where we try to streamline the optimization of any parameterized process that you’d usually figure out via trial and error. The central learning task is, given all of the parameter sets you’ve tried so far, trying to choose the next one to evaluate so that you have the best shot at finding the global maximum. Of course, Machine Learning algorithms themselves are parameterized processes tuned by trial and error, so these techniques can be used on them. My own WhizzML script, SMACdown, is a toy version of one of these techniques that does exactly this for BigML ensembles.
In the other corner, you had several workshops on how to further integrate people into the Machine Learning pipeline, either by inserting humans into the learning process or finding more intuitive ways of showing them the results of their learning.
- The Future of Interactive Learning Machines – This workshop featured a panoply of various human-in-the-loop learning settings, From humans giving suggestion to Machine Learning algorithms, to machine-learned models trying to teach humans. There was, in particular, an interesting talk on using reinforcement learning to help teachers plan lessons for children, which I’ll reference below.
- Interpretable Machine Learning for Complex Systems – This workshop featured a number of talks on ways to allow humans to better understand what a classifier is doing, why it makes the predictions it does, and how best to understand what data the classifier needs to do its job better.
So what is going on here? It seems like we want Machine Learning to be automatic . . . but we also want to find ways to keep people closely involved? It is a strange pair of ideas to have at the same time. Of course, people want things automated, but why do they want to stay involved, and how to those two goals co-exist?
A great little call-and-response on this topic happened between two workshops as I attended them. Alex Wiltschko from Twitter gave an interesting talk on using Bayesian parameter optimization to optimize the performance of their Hadoop jobs (among other things) and he made a great point about optimization in general: If there’s a way to “cheat” your objective, so that the objective increases without making things intuitively “better”, the computer will find it. This means you need to choose your objective very carefully so the mathematical objective always matches your intuition. In his case, this meant a lot of trial and error, and a lot of consultations with the people running the Hadoop cluster.
An echo and example came from the other side of the “interactivity divide”, in the workshop on interactive learning. Emma Brunskill had put together a system that optimized the presentation of tutorial modules (videos, exercises, and so on) being presented to young math students. The objective the system was trying to optimize was something like the performance on a test at the end of the term. Simple enough, right? Except that one of the subjects being taught was particularly difficult. So difficult that few of the tutorial modules managed to improve the students’ scores. The optimizer, sensing this futility, promptly decided not to bother teaching this subject at all. This answer is of course unsatisfying to the human operator; the curriculum should be a constraint on the optimization, not a result of it.
Crucially though, there’s no way the computer could know this is the case without the human operator telling it so. And there’s no way for the human to know that the computer needs to know this unless the human is in the loop.
Herein lies the tension between interactivity and automation.
On one hand, people want and need many of the tedious and unnecessary details around Machine Learning to be automated away; often such details require expertise and/or capital to resolve appropriately and end up as barriers to adoption.
On the other, people still want and need to interact with Machine Learning so they can understand what the “Machine” has learned and steer the learning process towards a better answer if the initial one is unsatisfactory. Importantly, we don’t need to invoke a luddite-like mistrust of technology to explain this point of view. The reality is that people should be suspicious of the first thing a Machine Learning algorithm spits out, because the numerical objective that the computer is trying to optimize often does not match the real-world objective. Once the human and machine agree precisely on the nature of the problem, Machine Learning works amazingly well, but it sometimes takes several rounds of interaction to generate an agreement of the necessary precision.
Said another way, we don’t need Machine Learning that is “automatic”. We need Machine Learning that is comfortable and natural for humans to operate. Automating away laborious details is only a small part of this process.
If this sounds familiar to those of you in the software world, it’s because we’re here all the time.
From Automation to Abstraction
In the software world, we often speak in terms of abstractions. A good software library or programming language will hide unnecessary details from the user, exposing only the modes of interaction necessary to operate the software in a natural way. We say that the library or language is a layer of abstraction over the underlying software.
For those of you unfamiliar with the concept, consider the C programming language. In C, we can write a statement like this:
x = y + 3
The C compiler converts this operation to machine code, which requires knowing where in memory the x and y variables are, loading these variables into registers, loading the binary value for “3” into a register, summing the values to a new register, assigning that result to a new variable, and so on.
The language hides machine code and registers from us so we can think in terms of operators and variables, the primitives of higher level problems. Moreover, it exposes an interface (mathematical expressions, functions, structs, and so on) that allows us to operate the layer underneath in a way that’s more useful and natural than if we worked on the layer directly. In this sense, the C language is a very good abstraction: It hides many of the things we’re almost never concerned about, and exposes the relevant functionality in an easier-to-use way.
It’s helpful to think about abstractions in the same way we think about compression algorithms. They can be “strong”, so that they hide a lot of details, or “weak” so they hide few. They can also be “very lossy”, so that they expose a poor interface, up to “lossless”, where the interface exposed can do everything that the hidden details can do. The devil of creating a good abstraction is rather the same as creating a good compression algorithm: You want to hide as many unimportant details from your users as possible, while hiding as little as possible that those same users want to see or use. The C language as an abstraction over machine code is both quite strong (hides virtually all of the details of machine code from the user) and near-lossless (you can do the vast majority of things in C that are possible directly via machine code).
The astute reader can likely see the parallel to our view of Machine Learning; We have the same sort of tension here between the hiding of drudgeries and complexities while still providing useful modes of interaction between tool and user. Where, then, does “Machine Learning Automation” stand on our invented scale of abstractions?
Automations Are Often Lossy and Weak Abstractions
The problem (as I see it) with some of the automations on display at NIPS (and indeed in the industry at large) is that they are touted using the language of abstraction. There are often claims that such software will “automate data science” or “allow non-experts to use Machine Learning”, or the like. This is exactly what you might say about the C language; that it “automates machine code generation” or “allows people who don’t know assembly to program”, and you would be right.
As an example of why I find this a bit disingenuous, consider using Bayesian parameter optimization to tune the parameters of Machine Learning algorithms, one of my favorite newish techniques. It’s a good idea, people in the Machine Learning community generally love it, and it certainly has the power to produce better models from existing software. But how good of an abstraction is it, on the basis of drudgery avoided and the quality of the interface exposed?
Put another way, suppose we implemented some of these parameter optimizations on top of, say scikit-learn (and some people have). Now suppose there’s a user that wants to use this on data she has in a CSV file to train and deploy a model. Here’s a sample of some of the other details she’s worried about:
- Installing Python
- How to write Python code
- Loading a CSV in Python
- Encoding categorical / text / missing values
- Converting the encoded data to appropriate data structures
- Understanding something about how the learned model makes its predictions
- Writing prediction code around the learned model
- Writing/maintaining some kind of service that will make predictions on-demand
- Getting a sense of the learned model’s performance
Of course, things get even more complicated at scale, as is their wont:
- Get access to / maintain a cluster
- Make sure that all cluster nodes have the necessary software
- Load your data onto the cluster
- Write cluster specific software
- Deal with cluster machine / job limitations (e.g., lack of memory)
This is what I mean when I say Machine Learning automations are often weak abstractions: They hide very few details and provide little in the way of a useful interface. They simply don’t usually make realistic Machine Learning much easier to use. Sure, they prevent you from having to hand-fit maybe a couple dozen parameters, but the learning algorithm is already fitting potentially thousands of parameters. In that context, automated parameter tuning, or algorithm selection, or preprocessing doesn’t seem like it’s the thing that suddenly makes the field accessible to non-experts.
In addition, the abstraction is also “lossy” under our definition above; it hides those parameters, but usually doesn’t provide any sort of natural way for people to interact with the optimization. How good is the solution? How well does that match the user’s internal notion of “good”? How can you modify it to do better? All of those questions are left unanswered. You are expected to take the results on faith. As I said earlier, that might not be a good idea.
A Better Path Forward
So why am I beating on Bayesian parameter optimization? I said that I think it’s awesome and I really do. But I don’t buy that it’s going to be the thing that brings the masses to Machine Learning. For that, we’re going to need proper abstractions; layers that hide details like those above from the user, while providing them with novel and useful ways to collaborate with the algorithm.
This is part of the reason we created WhizzML and Flatline, our DSLs for Machine Learning workflows and feature transformation. Yes, you do have to learn the languages to use them. But once you do, you realize that the languages are strong abstractions over the concerns above. Hardware, software, and scaling issues are no longer any concern as everything happens on BigML-maintained infrastructure. Moreover, you can interact graphically with any resources you create via script in the BigML interface.
The goal of making Machine Learning easier to use by anyone is a good one. Certainly, there are a lot of Machine Learning sub-tasks that could bear automating, and part of the road to more accessible Machine Learning is probably paved with “one-click” style automations. I would guess, however, that the larger part is paved with abstractions; ways of exposing the interactions people want and need to have with Machine Learning in an intuitive way, while hiding unnecessary details. The research community is right to want both automation and interactivity: If we’re clever and careful we can have it both ways!