Machine Learning Benchmarking: You’re Doing It Wrong

Posted by

I’m not going to bury the lede: Most machine learning benchmarks are bad.  And not just kinda-sorta nit-picky bad, but catastrophically and fundamentally flawed. 

ML Benchmarking

TL;DR: Please, for the love of statistics, do not trust any machine learning benchmark that:

  1. Does not split training and test data
  2. Does not use the identical train/test split(s) across each algorithm being tested
  3. Does not do multiple train/test splits
  4. Uses less than five different datasets (or appropriately qualifies its conclusions)
  5. Uses less than three well-established metrics to evaluate the results (or appropriately qualifies its conclusions)
  6. Relies on one of the services / software packages being tested to compute quality metrics
  7. Does not use the same software to compute all quality metrics for all algorithms being tested

Feel free to hurl accusations of bias my way. After all, I work at a company that’s probably been benchmarked a time or two. But these are rules I learned the hard way before I started working at BigML, and you should most certainly adopt them and avoid my mistakes.

Now let’s get to the long version.

Habits of Mind

The term “data scientist” has had its fair share of ups and downs over the past few years. At the same time, it can indicate both a person who’s technical skills are in high demand and a code word for an expensive charlatan. Just the same, I find it useful, not so much as a description of a skill set, but as a reminder of one quality you must have in order to be successful when trying to extract value from data. You must have the habits of mind of a scientist.

What do I mean by this? Primarily, I mean the intellectual humility necessary to be one’s own harshest critic. To treat any potential success or conclusion as spurious and do everything possible to explain it away as such. Why? Because often that humility is the only thing between junk science and a bad business decision. If you don’t expose the weaknesses of your process, putting it into production surely will.

This is obvious in few places more than benchmarking machine learning software, algorithms, and services, where weak processes seem to be the rule rather than the exception. Let’s start with a benchmarking fable.

A Tale Of Two Coders

Let’s say you are the CEO of a software company composed of you and two developers. You just got funding to grow to 15. Being the intrepid data scientist that you are, you gather some data on your two employees.

First, you ask each of them a question:  “How many lines of code did you write today?”.

“About 200.” says one.

“About 300.” says the other.

You lace your fingers and sit back in your chair with a knowing smile, confident you have divined which is the more productive of the two employees. To uncover the driving force behind this discrepancy, you examine the resumes of the two employees. “Aha!”  You say to yourself, the thrill of discovery coursing through your veins. “The superior employee is from New Jersey and the other is from Rhode Island!”  You promptly go out and hire 12 people from New Jersey, congratulating yourself the entire time on your principled, data-driven hiring strategy.

Of course, this is completely crazy. I hope that no one in their right mind would actually do this. Anyone who witnessed or read about such a course of action would understand how weak the drawn conclusions are.

And yet I’ve seen a dozen benchmarks of machine learning software that make at least one of the same mistakes.  These mistakes generally fall into one of three categories that I like to think of as the three-legged stool for good benchmarking: 

  • Replications: The number of times each test is replicated to account for random chance,
  • Datasets: The number and nature of datasets that you use for testing
  • Metrics: The way you measure the result of the test. 

Let’s visit these in reverse order with our fable in mind.

3 legged stool

#3: Metrics

Probably the biggest problem most developers would have with the above story is the use of “lines of code generated” as a metric to determine developer quality. These people aren’t wrong: Basically everyone concludes that it is a terrible metric.

I wish that people doing ML benchmarks could mount this level of care for their metric choices. For instance, how many of the people who regularly report results in terms of area under an ROC curve (AUC) are aware that there is research showing that the metric is mathematically incoherent? Or that when you compare models using the AUC, you’ll often get results that are opposite those given by other, equally established metrics? There isn’t a broad mathematical consensus on the validity of the AUC in general, but the arguments against it are sound, and so if you’re making decisions based on AUC, you should at least be aware of some of the counter-arguments and see if they make sense to you.

And the decision to use or not use an individual metric isn’t without serious repercussions. In my own academic work prior to joining BigML, I found that, in a somewhat broad test of datasets and classifiers, I could choose a metric that would make a given classifier in a paired comparison seem better than the other in more than 40% of possible comparisons (out of tens of thousands)!  The case where all metrics agree with one another is rarer than you might think, and when they don’t agree the result of your comparison hinges completely on your choice of metric.

The main way out of this is either to be more or less specific about your choice. You might make the former choice in cases where you have a very good idea of what your actual, real-world loss or cost function is. You might, for example, know the exact values of a cost matrix for your predictions. In this case, you can just use that as your metric and it doesn’t matter if this metric is good or bad in general; it’s by definition perfect for this problem.

If you don’t know the particulars of your loss function in advance, another manner of dealing with this problem is to test multiple metrics. Use three or four or five different common metrics and make sure they agree at least on which algorithm is better. If they don’t, you might be in a case where it’s too close to call unless you’re more specific about what you want (that is, which metric is most appropriate for your application).

But there’s an even worse and more subtle problem with the scenario above. Notice that the CEO doesn’t independently measure the lines of code that each developer is producing. Instead, he simply asks them to report it. Again, an awful idea.  How do you know they’re counting in just the same way? How do you know they worked on things that were similarly difficult? How do you know neither of them is lying outright?

Metrics we use to evaluate machine learning models are comparatively well defined, but there are still corner cases all over the place. To take a simple example, when you compute the accuracy of a model, you usually do with respect to some threshold on the model’s probability prediction. If the threshold is 0.5, then the logic is something like “If the predicted probability is greater than 0.5, predict true, if not predict false”. But depending on the software, you might get “greater than or equal to” instead. If you’re relying on different sources to report metrics, you might hit these differences, and they might well matter.

It almost goes without saying, but the fix here is just consistency, and ideally objectivity. When you compare models from two different sources, make sure the tools you use for evaluation are the same, and ideally not ones provided by either of the sources being tested. It’s a pain, yes, but if you’re comparing weights there’s just no way around buying your own scale. There are plenty of open-source reference implementations of almost any metric you can think of. Use one.

#2: Datasets

For the sake of argument, though, let’s assume that you have a good metric for measuring developer productivity. You’re still only measuring performance on the one thing each of your developers did yesterday! What if they’re writing python, and you’re hiring a javascript developer? What if you’re hiring a UI designer? What if you’re hiring a sales rep? Do you really think that the rules for finding a successful python developer will generalize so far?

Generalization can be a dangerous business. Those who have practiced machine learning for long enough know this from hard experience. Which is why it’s infuriating to see someone test a handful of algorithms on one or two or three datasets and then make a statement like, “As you can see from these results, algorithm X is to be preferred for classification problems.”

No, that’s not what I can see at all. What I see is that (assuming you’ve done absolutely everything else correctly), algorithm X performed better than the tested alternatives on one or two or three other problems. You might be tempted to think this is better than nothing, but depending on what algorithm you fancy you can *almost always* find a handful of datasets that show “definitively” that your chosen algorithm is the state of the art. In fact, go ahead and click on “generate abstract” on my ML benchmarking page to do exactly this!

This might seem unbelievable, but the reality is that supervised classification problems, though they seem similar enough on the surface, can be quite diverse mathematically. Dimensionality, decision boundary complexity, data types, label noise, class imbalance, and many other things make classification an incredibly rich problem space. Algorithms that succeed spectacularly with a dozen features fail just as spectacularly with a hundred. There’s a reason people still turn to logistic regression in spite of the superior performance of random forests and/or deep learning in the majority of cases: It’s because there are still a whole lot of datasets where logistic regression is just as good and tons faster. The “best thing” simply always has and always will depend on the dataset to which the thing is applied.

The solution here, as with metrics, is to be more or less specific. If you know basically the data shape and characteristic of every machine learning problem that you’ll face in your job, and you have a reasonably large collection of datasets laying around that is nicely representative of your problem space, then yes, you can use these to conduct a benchmark that will tell you what the best sort of algorithm is for this subset of problems.

If you want to know the best thing generally, you’ll have to do quite a bit more work. My benchmark uses over fifty datasets and I’m still not comfortable enough with its breadth to say that I’ve really uncovered anything that could be said about machine learning problems as a whole (besides that it’s breathtakingly easy to find exceptions to any proposed rule). And even if rules could be found, for how long would they hold? The list of machine learning use cases and their relative importance grows and changes every day. The truth about machine learning today isn’t likely to be the truth tomorrow.

#1: Replications

Finally, and maybe most obviously: The entire deductive process in the fable above is based on only a single day of data from two employees. Even the most basic mathematical due diligence would tell you that you can’t learn anything from so few examples.

Yet there are benchmarks out there that try to draw conclusions from a single training/test split on a single dataset. Making decisions like this based on a point estimate of performance derived from a not-that-big test set is a problem for statistical reasons that are not even all that deep, which is a shame as single-holdout based competitions like the sort that happen on Kaggle are implicitly training novice practitioners to do exactly this.

How do you remedy this?  The blog post above suggests some simple statistical tests you can do based on the number of examples in the test set, which is fine and good and way, way better than nothing.  When you’re evaluating algorithms or frameworks or collections of parameter settings rather than the individual models they produce, however, there are more sources of randomness than just the data itself.  There are, for example, things like random seeds, initializations, and the order in which the data is presented to the algorithm.  Tests based on the dataset don’t account for “luck” with those model-based aspects of training.

There aren’t any perfect ways around this, but you can get a good part of the way there by doing a lot of train/test splits (several runs of cross-validation, for example), and varying the randomized parts of training (seed, data ordering, etc.) with each split.  After you’ve accumulated the results, you might be tempted to average together these results and then choose the algorithm with the higher average, but this obscures the main utility of doing multiple estimates, which is that you get to know something about the distribution of all of those estimates.

Suppose, for example, you have a dataset of 500 points. You do five 80%/20% training/test splits of the data, and measure the performance on each split with two different algorithms (of course, you’re using the exact same five splits for each algorithm, right?):

Algorithm 1: [0.75, 0.9, 0.7, 0.85, 0.9].  Average = 0.820

Algorithm 2: [0.73, 0.84, 0.91, 0.74, 0.89].  Average = 0.821

Sure, the second algorithm has better average performance, but given the swings from split to split, this performance difference is probably just an artifact of the overall variance in the data. Stated another way, it’s really unlikely that two algorithms are going to perform identically on every split, so one or the other of them will almost certainly end up being “the winner”. But is it just a random coin flip to decide who wins? If the split-to-split variance is high relative to the difference in performance, it gives us a clue that it might be.

Unfortunately, even if a statistical test shows that the groups of results are significantly different, this is still not enough by itself to declare that one algorithm is better than another (this would be abuse of statistical tests for several reasons).  However, the converse should be true: If one algorithm is truly better than another in any reasonable sense, it should certainly survive this test.

What, then, if this test fails?  What can we say about the performance of the two models?  This is where we have to be very careful. It’s tempting to dismiss the results by saying, “Neither, they’re about the same”, but the more precise answer is that our test didn’t give evidence that the performance of either one was better than the other.  It might very well be the case that one is better than the other and we just don’t have the means (the amount of data, the resources, the time, etc.) to do a test that shows it properly. Or perhaps you do have those means and you should avail yourself of them.  Beware the trap, however, of endless fiddling with with modeling parameters on the same data.  For lots of datasets, real performance differences between algorithms are both difficult to detect and often too small to be important.

For me, though, the more interesting bit of this analysis is again the variance of the results. Above we have a mean performance of 0.82, with a range of 0.7 to 0.9.  That result is quite different to a mean performance of 0.82 with a range of 0.815 to 0.823. In the former case, you’d go to production having a good bit of uncertainty around the actual expected performance. In the latter, you’d expect the performance to be much more stable.  I’d say it’s a fairly important distinction, and one you can’t possibly see with a single split.

There are many cases in which you can’t know with any reasonable level of certainty if one algorithm is better than another with a single train/test split. Unless you have some idea of the variance that comes with a different test set, there’s no way to say for sure if the difference you see (and you will very likely see a difference!) might be a product of random chance.

Emerge with The Good Stuff

I get it. I’m right there with you. When I’m running tests, I want so badly to get to “the truth”. I want the results of my tests to mean something, and it can be so, so tempting to quit early. To run one or two splits on a dataset and say, “Yeah, I get the idea.” To finish as quickly as possible with testing so you can move on to the very satisfying phase of knowing something that others don’t.

But as with almost any endeavor in data mining, the landscape of benchmarking is littered with fool’s gold. There are so very many tests one can do that are meaningless, where the results are quite literally worth less than doing no test at all. Only if one is careful about the procedure, skeptical of results, and circumscribed in one’s conclusions is it possible to sort out the rare truth from the myriad of ill-supported fictions.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s