In my previous two posts in this series, I’ve essentially argued both sides of the same issue. In the first, I explained why deep learning is not a panacea, when machine learning systems (now and likely always) will fail, and why deep learning in its current state is not immune to these failures.

In the second post, I explained why deep learning, from the perspective of machine learning scientists and engineers, is an important advance: Rather than a learning algorithm, deep learning gives us a flexible, extensible framework for specifying machine learning algorithms. Many of the algorithms so far expressed in that framework give orders of magnitude-level improvement on the performance of previous solutions. In addition, it’s a tool that allows us to tackle some problems heretofore unsolvable directly by machine learning methods.

For those of you wanting a clean sound-byte about deep learning, I’m afraid you won’t get it from me. The reason I’ve written so much here is that I think nature of the advance that deep learning has brought to machine learning is complex and defies broad judgments, especially at this fairly early stage in its development. But I think it is worth it to take a step backward and try to understand which judgments are important and how to make them properly.

### Flaky Machines or Lazy People?

This series of posts was motivated in part by my encounters with Gary Marcus’ perspectives on deep learning. At the root of his positions is the notion that deep learning (and here he means “statistical machine learning”) is, in various ways, “not enough”. In his medium post, it’s “not enough” for general intelligence, and in the synced interview it’s “not enough” to be “reliable”.

This notion of whether current machine learning systems are “good enough” gets to the heart of the back and forth on deep learning. Marcus cites driverless cars as an example of how AI isn’t mature enough yet to rely on 100%, and that AI needs a “foundational change” to ensure a safe level of reliability. There’s a bit of ambiguity in the interview about what he means by AI, but my own impression is that this is less of a critique of machine learning, and more of a critique of the software around it.

For example, we have vision systems able to track and identify pedestrians on the road. These systems, as Marcus says, are mostly reliable but certainly make occasional mistakes. The job of academic and corporate researchers is to create these systems and make them as error-free as possible, but in the long run, they will always have some degree of unreliability.

Something consumes the predictions of these vision systems and acts accordingly; it is and always will be the job of that thing to avoid treating these predictions as the unvarnished truth. If the predictions were guaranteed to be correct, the consumer’s job would be much easier. As it is, consuming the predictions of a vision system requires some level of cleverness and skepticism. Maybe that cleverness involves awareness of separate sensor systems or other information streams like location and time of day. It might require symbolic approaches of the type Marcus favors. It might require more and very different deep learning, as Yann LeCun suggests. It might require something that’s entirely new.

Designing software that works properly with machine-learned models is hard. You have to do the difficult work of characterizing the model’s weaknesses and engineering around them. But critical readers should reject the notion that machine learning needs to provide extreme reliability on its own in order to be useful in mission critical situations. If a vision system can accurately find and track 95% of pedestrians, and other sensors and logic pick up the remaining 5%, you’ve arrived at “enough” without having a perfect model.

### When is “Enough” Enough?

So then the question becomes, “are we there yet?” with current ML systems. That depends, of course, on how good we think we need them to be for the engineers and domain experts to pull their outputs across the finish line. There are a lot of areas in which deep learning puts us in shouting distance, but it general, whether or not we’re there yet depends in turn on what you want the system to do and the quality of your engineers. When thinking about that question, though, it’s important to consider that the finish line might not be exactly where you think it is.

Consider the problem of machine translation. Douglas Hofstadter wrote a great article where he systematically lays bare the flaws in state-of-the-art machine translation systems. He’s right: For linguistic ideas with even a little complexity, they’re not great and are at times totally unusable. But the whole article reminded me of a blog post Hal Daumé III wrote more than 10 years ago when he and I were both recent Ph.D.’s In it, he wonders how much of human translation is better than computer translation when you really consider everything (street signs, menus, simple interpersonal interactions, and so on). Again, he asks this more than ten years ago.

The point here is that if machine translation for these things is already noticeably better than the second-rate human translations we apply in practice (or was ten years ago), there’s already a sense in which the models we have are very much good enough. How it deals with more complex phrases and ideas is an interesting question, and might yield new research directions, but this is all academic as far as its applicability is concerned. The existing technology, imperfect as it is, has a use and a place in society.

Even less relevant is how “deep” the model’s knowledge is, or how “stupid” it is, or whether the algorithm is “actually learning” (whatever that means). These are all flavors of the “computers don’t really understand what they’re doing” argument that traces its way through Hofstadter, John Searle,  Alan Turing and dozens of other philosophers all the way back to Ada Lovelace. There are loads of counter-arguments (I have even spun out a few of my own versions), but maybe the most compelling reason to ignore these questions is that the answers are often less interesting than the answer to the question, “Can we use it?”

A number of years ago, my wife and I hosted two members of a Belgian boys choir that was on tour. Neither she nor I spoke any French, so we relied on Google Translate to communicate with them. To this day, I remember typing “We made a pie.  Would you like some?” into my phone and watching their faces light up as the translation appeared. Did the computer understand anything about pie, or generosity, or the happiness of children, or how its own flawed translations could help create indelible memories?  Probably not.  But we did!

### The Final Exam

The criticism that machine learning is not enough on its own to produce systems that exhibit reliably intelligent behavior is a broken criticism. Deep learning gets us part of the way towards such systems, perhaps quite a lot of the way, but does anyone think it’s necessary or even advisable to cede the entire behavior of, say, a car to a machine-learned model? Saying no doesn’t mean backing away from a fully-autonomous car; as Marcus himself points out, there are other techniques in AI and software at large that are better suited to certain aspects of these problems. There can be many layers of human-comprehensible logic sitting between deep learning and the gas pedal, and it’s likely the totality of the system, rather than the learned component alone, that will display behavior that we might recognize as intelligent.

Is it a flaw or a problem with deep learning when it can’t solve the aspects of these problems that no one really wants or needs solved?  I don’t think so. Again, paraphrasing Marcus (and myself), machine learning is a tool. If you buy a nail gun and it jams, then yeah, that’s a problem with the nail gun, but if you try to use a nail gun to cut a piece of wood in half, that’s more of a problem with you. Deep learning is a very important step forward in the evolution of the tool (and a large one compared to other recent steps), but that step doesn’t change its fundamental nature. No matter what improvements you make, a nail gun is never going to become a table saw. Certainly, it’s unethical and bad business for tool manufacturers to make inflated claims about their tool’s usefulness, but it’s finally the job of the operator to determine which tool to use and how to use it.

Pundits can argue all day long about how impactful deep learning is and how smart machine learning can possibly be, but none of those arguments will matter in the long run. As I’ve said before, the only real test of the usefulness of machine learning is if domain experts and data engineers can leverage it to create software that has value for other human beings. Therein lies the power, the only real power, of new technology and the only goal that counts.

In the first in this series of posts, I discussed a bit about why deep learning isn’t fundamentally different from the rest of machine learning, and why that lack of a difference implies many contexts in which deep networks will fail to perform at a human level of cognition, for a variety of reasons.

Is there no difference, then, between deep learning and the rest of machine learning? Is Stuart Russell right that deep learning is nothing more than machine learning with faster computers, more data, and a few clever tricks, and is Rich Sutton right that such things are the main drivers behind machine learning’s recent advances?

Certainly, there’s a sense in which that’s true, but there’s also an important sense in which it’s not. More specifically, deep learning as it stands now represents an important advance in machine learning for reasons mostly unrelated to the access to increasing amounts of computation and data. In the last post, we covered how understanding deep learning is the same as the rest of machine learning is the key to knowing some of the problems that deep learning does not solve. Here, let’s try to understand how deep learning is different and maybe along the way we’ll find some problems that it solves better than anything else.

### How To Create a Machine Learning Algorithm

Before talking about machine learning, it’s important to know how existing machine learning algorithms have been created by the academic community. Oversimplifying dramatically, there’s usually a two-step process:

1. Come up with some objective function
2. Find a way to mathematically optimize that function with respect to training data

In machine learning parlance, usually, the objective function is basically any way of measuring the performance of your classifier on some data. So one objective function would be “What percent of the time does my classifier predict the right answer?”. When we optimize that objective, we mean that we learn the classifier’s internal workings so that it performs well on the training data by that measurement. If the measurement were “percent correct” as above, it means that we learn a classifier that gets the right answer on the training data all of the time, or as close to it as possible.

An easy, concrete example is ordinary least squares regression: The algorithm seeks to learn weights to minimize the squared difference between the model’s predictions and the truth in the data. There’s the objective function (the sum of the squared differences) and the method used to optimize it (ordinary least squares).

There are a lot of different variations on this theme. For example, most versions of support vector machines for classification are based on the same basic objective, but there are a lot of algorithms to optimize that objective, from quadratic programming approaches like the simplex and interior point methods to cutting plane algorithms, to sequential minimal optimization. Decision trees admit a lot of different criteria to optimize (information gain, Gini index, etc.), but given that criteria, the method of optimizing it is basically the same.

Importantly, these two steps don’t usually happen in sequence because one depends on the other: When you’re thinking of an objective function, you immediately put aside ones that you know you can’t mathematically optimize, because there’s really no use for such a function that you can’t optimize against the data that you have. It turns out that the majority of objective functions you’d like can’t be efficiently optimized mathematically, and in fact, much of the machine learning academic literature is devoted to finding clever objective functions or new and faster ways of optimizing existing ones.

### From Algorithms to Objectives

Now we can understand the first-way deep learning is rather different from the usual way of doing machine learning. In deep learning, we typically use only one basic way of optimizing the objective function, and that’s using a family of algorithms known collectively as gradient descent. Gradient descent means a bunch of different things, but they all rely on knowing the gradient of the objective function with respect to every parameter in the model. This, of course, means calculus and calculus is hard.

In deep networks, this is even harder because of the variety of things like activations functions, topologies, and types of connections. The programmer needs to know the derivative of the objective with respect to every parameter in every possible topology in order to optimize it, then they also need to know the ins and outs of all of the gradient descent algorithms they want to offer. Engineering-wise, it’s a nightmare!

The saving grace here is that the calculus is very mechanical: Given a network topology in all of its gory detail, the process of calculating the gradients is based on a set of rules and it’s not a difficult calculation, just massive and tedious and prone to small mistakes. So a whole bunch of somebodies finally buckled down, got the collection of rules together and turned that tedious process into computer code. The upshot is that now you can use programming to specify any type of network you want, including some objective, then just pass in the data and all of the calculus is figured out for you.

This is called automatic differentiation, and it’s the main thing that separates deep learning frameworks like Theano, Torch, Keras, and so on from the rest of computation. You just have to be able to specify your network and your objective function in one of these frameworks, and the software “knows” how to optimize it.

Put another way, instead of telling the computer what to do you’re now just telling it what you want.

### More Than What You Need

It’s hard to overstate how much more flexible this is as an engineering approach than what we used to do. Before deep learning, you’d look at the rather small list of problems that machine learning was able to solve and try to cram your specific problem into one of those bins. Is this a classification or regression problem? Use boosting or SVMs. Is it a clustering problem? Use k-means or g-means. Is it metric learning? LMNN or RCA! Collaborative filtering? SVD or PLSI! Label sequence learning? HMMs or CRFs or even M3Ns! Depending on what your inputs and desired outputs look like, you choose an acronym and off you go.

This works well for a lot of things, but for others, there’s just not a great answer. What if your input is a picture and your output is a text description of that picture? Or what if your input is a sequence of words and your output is that sequence in a different language? Before, you’d say things like, “Well, even though it’s sequences, this is kind of a classification problem, so we can preprocess the input with a windowing function, then learn a classifier, then transform the outputs, then learn another classifier on those outputs, then apply some sort of smoothing” and so on. The alternative to this sort of shoehorning the problem into an existing bag was to design a learning algorithm from scratch, but both options come to much the same thing: A big job for someone with loads of domain knowledge and machine learning expertise.

With a deep learning framework, you just say “this is the input, here’s what the output should be, and here’s how to measure the difference between the model’s output and what it should be. Give me a model.” Does this always work? Certainly not; all of those acronyms are not going away anytime soon. But the fact that deep learning can often be made to work for problems of any of the types above with very little in the way of domain knowledge or mathematical insight, that’s a powerful new capability.

Perhaps you can also see how much more extensible this is vs. the previous way of doing things. Any operation that is differentiable can be “thrown onto the pile”, its rules added to the list of rules for differentiation and then can become a part of any network structure. One very important operation currently on the pile is convolution, which is the current star of the deep learning show. This isn’t a scientific innovation; that convolution kernels can be learned via gradient descent is almost 30 year-old news, but in the context of an engine that can automatically differentiate the parameters of any network structure, you end up using them in combination with things like residual connections and batch normalization which pushes their performance to new heights.

Maybe just as important as the flexibility of deep learning frameworks is the fact that the gradient descent typically happens in tiny steps, where a small slice of the training data updates the classifier at each one. This may not seem like a big deal, but it means that you can take advantage of effectively an infinite amount of training data. This allows you to use a simulator to augment your training data and make your classifier more robust, which is a natural fit in areas like computer vision and game playing. Other machine learning algorithms have the same sort of update behavior, but given the complexity of the networks needed to solve some problems and the amount of data needed to properly fit them, data augmentation becomes a game-changing necessity.

You’re probably starting to realize now that “Deep Learning” isn’t really a machine learning algorithm, per se. Closer to the truth is that it’s a language for expressing machine learning algorithms, and that language is still getting more expressive all the time. This is partially what our own Tom Dietterich means when he says that we don’t really know what the limits are for deep learning yet. It’s tough to see where the story ends if only because its authors are still being provided with new words they could say. To say that something will never be expressible in an evolving language such as this one seems premature.

### Some Seasoning

Now, even considering the huge caveat that is the first post in this series, I’d like to put forth a couple of additional grains of salt. First, the above narrative gives the impression of complete novelty to the compositional nature of deep learning, but that’s not entirely so. There have been previous attempts in the machine learning literature to create algorithmic frameworks of this sort. One such attempt that comes to mind is Thorsten Joachims’ excellent work developing SVM-struct, which does for support vector machines much of what automatic differentiation does for deep learning. While SVM-struct does allow you to attack a diverse set of problems, it lacks the ability to incorporate new operators that constantly expand the space of possible internal structures for the learned model.

Second, admittedly, I may have oversimplified things a bit. The complexity of creating or selecting a problem-specific machine learning algorithm has not disappeared entirely. It’s just been pushed into the design of the network architecture: Instead of having a few dozen off the shelf algorithms that you must understand, each with five or ten parameters, you’ve now got the entire world of deep learning to explore with each new problem you encounter. For problems that already have well-defined solutions, deep learning’s flexibility can be more of a distraction than an asset.

### The Proof Is In The Pudding

All of that said, it would be silly to talk about any of this if it hadn’t led to some absurd gains in performance in marquee problems. I’m old enough to have been a young researcher in 2010, when state-of-the-art results were claimed on CIFAR-10 at just over 20% error. Now, multiple papers claim error rates under 3% and even more claim sub-20% error on ImageNet, which has 100 times as many classes. We see similar sorts of improvement in object detection, speech recognition, and machine translation.

In addition, this formalism allows us to apply machine learning directly to a significantly broader set of possible problems like question answering, or image denoising, or <deep breath> generating HTML from a hand-drawn sketch of a web page.

Even though I saw that last one for the first time a year ago, it still makes me a bit dizzy to think about it. To someone in the field who spent countless hours learning how to engineer the domain-specific features needed to solve these problems, and countless more finessing classifier outputs into something that actually resembled a solution, seeing these algorithms work, even imperfectly, borders on a religious experience.

Deep learning is revolutionary from a number of angles, but most of those are viewable primarily from the inside of the field rather than the outside. Deep learning as it is now is not an “I’m afraid I can’t do that”-level advance in the field. But for those of us slaving away in the trenches, it can look very special.  Even the closing barb of Gary Marcus’ synced interview has a nice ring to it:

They work most of the time, but you don’t know when they’re gonna do something totally bizarre.

Really? Machine learning? Working most of the time? It’s music to my ears! I think Marcus is talking about driverless cars here, but roughly the same thing could be said of state-of-the-art speech recognition or image classification. The quote is an unintentional compliment to the community; Marcus doesn’t seem to be aware of how recent this level of success is, how difficult it was to get here, and how big of a part deep learning played in the most recent ascent. Why yes, it is working most of the time! Thank you for noticing!

With regard to the second half of that quote, clearly there’s work left to be done, but what does the rest of that work look like and who gets to decide? In the final post in this series, I’ll speculate on what the current arguments about machine learning say about its use and its future. So stay tuned…

Gary Marcus has emerged as one of deep learning’s chief skeptics. In a recent interview, and a slightly less recent medium post, he discusses his feud with deep learning pioneer Yann LeCun and some of his views on how deep learning is overhyped.

I find the whole thing entertaining, but at many times LeCun and Marcus are talking past each other more than with each other. Marcus seems to me to be either unaware of or ignoring certain truths about machine learning and LeCun seems to basically agree with Marcus’ ideas in a way that’s unsatisfying for Marcus.

The temptation for me to brush 10 years of dust off of my professor hat is too much to ignore. Outside observers could benefit greatly from some additional context in this discussion and in this series of posts I’ll be happy to provide some. Most important here, in my opinion, is to understand where the variety of perspectives come from, and where deep learning sits relative to the rest of machine learning. Deep learning is both an incremental advance and a revolutionary one. It’s the same old stuff and something entirely new. Which one you see depends on how you choose to look at it.

### The Usual Awfulness of Machine Learning

Marcus’ post, The Deepest Problem with Deep Learning is written partly in response to Yoshi Bengio’s recent-ish interview with Technology Review. In the post, Marcus comes off as a bit surprised that Bengio’s tone about deep learning is circumspect about its long term prospects, and goes on to reiterate some of his own long-held criticisms of the field.

Most of Marcus’ core arguments about deep learning’s weaknesses are valid and maybe more uncontroversial than he thinks: All of the problems with deep learning that he mentions are commonly encountered by practitioners in the wild. His post doesn’t belabor these arguments. Instead, he spends a good deal of it suggesting that the field is either in denial or deliberately misleading the public about the strengths and weaknesses of deep learning.

Not only is this incorrect, but it also unnecessarily weakens his top line arguments. In short, the problems with deep learning are worse than Marcus’ post suggests, and they are problems that infect all of machine learning. Alas, “confronting” academics with these realities is going to be met with a sigh and a shrug, because we’ve known about and documented all of these things for decades. However, it’s more than possible that, with the increased publicity around machine learning in the last few years, there are people out there who are informed about the field at a high-level while only tangentially aware of its well-known limitations. Let’s review those now.

### What Machine Learning Still Can’t Do

By now, examples of CNN-based image recognition being “defeated” by various unusual or manipulated input data should be old news. While the composition of these examples is an interesting curiosity to those in the field, it’s important to understand why they are surprising to almost no one with a background in machine learning.

Consider the following fake but realistic dataset of eight people, in which we know the height, weight, and number of pregnancies for eight people, and we want to predict their sex based on those variables:

 Height (in.) Weight (lbs.) Pregnancies Sex 72 170 0 M 71 140 0 M 74 250 0 M 76 300 0 M 69 160 0 F 65 140 2 F 60 100 1 F 63 150 0 F

Any reasonable decision tree induction algorithm will find a concise classifier (Height > 70 = Male else Female) that classifies the data perfectly. The model is certainly not perfect, but also not a terrible one by ML standards, considering the amount of data we have. It will almost certainly perform much better than chance at predicting peoples’ sex in the real world. And yet, any adult human will do better with the same input data. The model has an obvious (to us) blind spot: It doesn’t know that people over 5’10” who have been pregnant at least one time are overwhelmingly likely to be female.

This can easily be phrased in a more accusatory way: Even when given training data about men and women and the number of pregnancies each person has had, the model fails to encode any information at all about which sex is more likely to get pregnant!

It sounds pretty damning in those words; the model’s “knowledge” turns out to be incredibly shallow. But this is not a surprise to people in the field. Machine learning algorithms are by design parsimonious, myopic, and at the mercy of the amount and type of training data that you have. More problems are exposed when we allow the case of adversarially selected examples, where you are allowed to present examples constructed or chosen to “fool” the model. I’ll leave it as an exercise for the reader to calculate how well the classifier would do on a dataset of WNBA players and Kentucky Derby jockeys.

### Enter Deep Learning, To No Fanfare At All

Deep learning is not different (at least in this way) from the rest of statistical learning: All of the adversarial examples presented in the image recognition literature are more or less the same as the 5’11” person who’s been pregnant; there was nothing like that in the dataset, so there’s no good reason to expect the model would get it right, despite the “obviousness” of the right answer to us.

There are various machine learning techniques for addressing bits and pieces of this problem, but in general, it’s not something easily solvable within the confines of the algorithm. This isn’t a “flaw” in the algorithm per se; the algorithm is doing what it should with the data that it has. Marcus is right when he says that machine-learned models will fail to generalize to out-of-distribution inputs, but, I mean, come on. That’s the i.i.d. assumption! It’s been printed right there on the tin for decades!

Marcus’ assertion that “In a healthy field, everything would stop when a systematic class of errors that surprising and illuminating was discovered” presupposes that researchers in the field were surprised or problems illuminated by that particular class of errors. I certainly wasn’t, and my intuition is that few in the field would be. On the contrary, if you show me those images without telling me the classifier’s performance, I’m going to say something like “that’s going to be a tough one for it to get right”.

In the back-and-forth on Twitter, Marcus seems stung that the community is “dismissive” of this type of error, and scandalized that the possibility of this type of error isn’t mentioned in the landmark Nature paper on deep learning, and herein, I think, lies the disconnect. For the academic literature, this is too mundane and well-known of a limit to bother stating. Marcus wants a field-wide and very public mea culpa for a precondition of machine learning that was trotted out repeatedly during our classes in grad school. He will probably remain disappointed. Few in the community will see the need to restate that limitation every time there’s a new advance in machine learning; the existence of that limit is a part of the context of every advance, as much as the existence of computers themselves.

For communications with the public at large outside of the field, though, perhaps Marcus is right that such limits could take center stage a bit more often (as Bengio rightly puts them in his interview). Yes, it’s true! You can almost always find a way to break a machine learning model by fussing with the input data, and it’s often not even very hard! One more time for the people in the back:

People who think deep learning is immune to the usual problems associated with statistical machine learning are wrong, and those problems mean that many machine learning models can be broken by a weak adversary or even subtle, non-adversarial changes in the input data.

This makes machine learning sound pretty crummy, and again elicits quite a bit of hand-wringing from the uninitiated.  There are breathless diatribes about how machine learning systems can be, horror of horrors, fooled into making incorrect predictions!  They’re not wrong; if you’re in a situation where you think such trickery might be afoot, that absolutely has to be dealt with somewhere in your technology stack.  Then again, this is so even if you’re not using machine learning.

Fortunately, there are many, many cases where this sort of brittleness is just not that much of a problem. In speech recognition, for example, there’s no one trying to “fool” the model and languages don’t typically undergo massive distributional changes or have particularly strange and crucial corner cases. Hence, all speech recognition systems use machine learning and the models do well enough to be worth billions of dollars.

Yes, all machine-learned models will fail somehow. But don’t conflate this failure with a lack of usability.

### Not Even Close

I won’t go as deeply into Marcus’ other points (such as the limits on the type of reasoning deep learning can do or its ability to understand natural language) in detail, but I found it interesting how closely those points coincide with someone else’s arguments about why “strong AI” probably won’t happen soon. That was written before I’d even heard of Gary Marcus and the relevant section is comprised mostly of ideas that I heard many times over the course of my grad school education (which is now far –  disturbingly far – in the past). Yes, these points are again valid, but among people in the field, they again have little novelty.

By and large, Marcus is right about the limitations of statistical machine learning, and anyone suggesting that deep learning is spectacularly different on these particular axes is at least a little bit misinformed (okay, maybe it’s a little bit different). For the most part, though, I don’t see experts in the field suggesting this.  Certainly not to the pathological levels hinted at by Marcus’ Medium post. I do readily admit the possibility that, amid the glow of high-profile successes and the public spotlight, that all of the academic theory and empirical results showing exactly how and when machine learning fails may get lost in the noise, and hopefully I’ve done a little to clarify and contextualize some of those failures.

So is that it, then? Is deep learning really nothing more than another thing drawn from the same bag of (somewhat fragile) tricks as the rest of machine learning?  As I’ve already said, that depends on how you look at it. If you look at it as we have in this post, yes, it’s not so very different. In my next post, however, we’ll take a look from another angle and see if we can spot some differences between deep learning and the rest of machine learning.

BigML has added multiple linear regression to its suite of supervised learning methods. In this sixth and final blog post of our series, we will give a rundown of the technical details for this method.

## Model Definition

Given a numeric objective field $y$, we model its response as a linear combination of our inputs $x_1,\cdots,x_n$, and an intercept value $\beta_0$.

$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n = \beta_0 + \sum_{i=1}^n \beta_i x_i$

### Simple Linear Regression

For illustrative purposes, let’s consider the case of a problem with a single input. We can see that the above expression then represents a line with slope $\beta_1$ and intercept $\beta_0$.

$y = \beta_0 + \beta_1 x$

The task now is to find the values of $\beta_0, \beta_1$ that parameterize a line which is the best fit for our data. In order to do so we must obtain a metric which quantifies how well a given line fits the data.

Given a candidate line, we can measure the vertical distance between the line and each of our data points. These distances are called residuals. Squaring the residual for each data point and computing the sum, we get our metric.

$S = \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2$

As one might expect, the sum of squared residuals is minimized when $\beta_0, \beta_1$ define a line that passes more or less thorough the middle of the data points.

### Multiple Linear Regression

When we deal with multiple input variables, it becomes more convenient to express the problem using vector and matrix notation. For a dataset with $n$ rows and $p$ inputs, define $\mathbf{y}$ as a column vector of length $n$ containing the objective values, $\mathbf{X}$ as a $n \times p$ matrix where each row corresponds to a particular input instance, and $\mathbf{\beta}$ as a column vector of length $p$ containing values of the regression coefficients. The sum of squared residuals can thus be expressed as:

$S = ||\mathbf{y - X\beta}||_2^2$

The value of $\mathbf{\beta}$ which minimizes this is given by the closed-form expression:

$\mathbf{\beta = (X^T X)^{-1} X^T y}$

The matrix inverse is the most computationally intensive portion of solving a linear regression problem. Rather than directly constructing the matrix $\mathbf{X}$ and performing the inverse, BigML’s implementation uses an orthogonal decomposition which can be incrementally updated with observed data. This allows for solving linear regression problems with datasets which are too large to fit into memory.

## Predictions

Predicting new data points with a linear regression model is just about as easy as it can get. We simply take the coefficients $\beta_0,\ldots,\beta_n$ from the model and evaluate the regression equation above to obtain a predicted value for $y$. BigML also returns two metrics that describe the quality of the prediction: the confidence interval and the prediction interval. These are illustrated in the following figure:

These two intervals carry different meanings. Depending on how the predictions are to be used, one will be more suitable than the other.

The confidence interval is the narrower of the two. It gives the 95% confidence range for the mean response. If you were to sample a large number of points at the same x-coordinate, there is a 95% probability that the mean of their y values will be within this range.

The prediction interval is the wider interval. For a single point at the given x-coordinate, its y value will be within this range with 95% probability.

## BigML Field Types and Linear Regression

In the regression equation, all of the input variables $x_n$ are numeric values. Naturally, BigML’s linear regression model also supports categorical, text, and items fields as inputs. If you have seen how our logistic regression models handle these inputs, then this will be mostly familiar, but there are a couple important differences.

### Categorical Fields

Categorical fields are transformed to numeric values via field codings. By default, linear regression uses a dummy coding system. For a categorical field with class values, there will be n-1 numeric predictor variables x. We designate one class value as the reference value (by default the first one in lexicographic order). Each of the predictors corresponds to one of the remaining class values, taking a value of 1 when that value appears and 0 otherwise. For example, consider a categorical field with values “Red”, “Green”, and “Blue”. Since there are 3 class values, dummy coding will produce 2 numeric predictors x1 and x2. Assuming we set the reference value to “Red”, each class value produces the following predictor values:

Field value x1 x2
Red 0 0
Green 1 0
Blue 0 1

Other coding systems such as contrast coding are also supported. For more details check out the API documentation.

### Text and Items Fields

Text and items fields are treated in the same fashion. There will be one numeric predictor for each term in the tag cloud/items list. The value for each predictor is the number of times that term/item occurs in the input.

### Missing Values

If an input field contains missing values in the training data then an additional binary-valued predictor will be created which takes a value of 1 when the field is missing and 0 otherwise.  The value for all other predictors pertaining to the field will be 0 when the field is missing. For example, a numeric field with missing values will have two predictors: one for the field itself plus the missing value predictor. If the input has a missing value for this field, then its two predictors will be (0,1), in contrast, if the field is not missing, but equal to zero, then the predictors will be (0,0).

## Wrap Up

That’s pretty much it for the nitty-gritty of multiple linear regression. Being a rather venerable machine learning tool, its internals are relatively straightforward. Nevertheless, you should find that it applies well to many real-world learning problems. Head over to the dashboard and give it a try!

This blog post, the fifth of our series of six posts about Linear regressions, focuses on those users that want to automate their Machine Learning workflows using programming languages. If you follow the BigML blog, you may already be familiar with WhizzML, BigML’s domain-specific language for automating Machine Learning workflows, implementing high-level Machine Learning algorithms, and easily sharing them with others. WhizzML helps developers create Machine Learning workflows and execute them entirely in the cloud. This avoids network problems, memory issues and lack of computing capacity while taking full advantage of WhizzML’s built-in parallelization. If you aren’t familiar with WhizzML yet, we recommend that you read the series of posts we published this summer about how to create WhizzML scripts: Part 1, Part 2 and Part 3 to quickly discover the benefits.

To help automate the manipulation of BigML’s Machine Learning resources, we also maintain a set of bindingswhich allow users to work in their favorite language (Java, C#, PHP, Swift, and others) with the BigML platform.

Let’s see how to use Linear Regressions through both the popular BigML Python Bindings and WhizzML. Note that the operations described in this post are also available in this list of bindings.

The first step is creating Linear Regressions with the default settings. We start from an existing Dataset to train the model in BigML so our call to the API will need to include the Dataset ID we want to use for training as shown below:

;; Creates a linearregression with default parameters
(define my_linearregression
(create-linearregression {"dataset" training_dataset}))

The BigML API is mostly asynchronous, that is, the above creation function will return a response before the Linear Regression creation is completed, usually the response informs that creation has started and the resource is in progress. This implies that the Linear Regression is not ready to predict with it right after the code snippet is executed, so you must wait for its completion before you can start with the predictions. A way to get it once it’s finished is to use the directive “create-and-wait-linearregression” for that:

;; Creates a linearregression with default settings. Once it's
;; completed the ID is stored in my_linearregression variable
(define my_linearregression
(create-and-wait-linearregression {"dataset" training_dataset}))

If you prefer to use the Python Bindings, the equivalent code is this:

from bigml.api import BigML
api = BigML()

my_linearregression = \
api.create_linearregression("dataset/59b0f8c7b95b392f12000000")

Next up, we will configure some properties of a Linear Regression with WhizzML. All the configuration properties can be easily added using property pairs such as <property_name> and <property_value> as in the example below. For instance, to create an optimized Linear Regression from a dataset, BigML sets the number of model candidates to 128. If you prefer a lower number of steps, you should add the property “number_of_model_candidates and set it to 10. Additionally, you might want to set the value used by the Linear Regression when numeric fields are missing. Then, you need to set thedefault_numeric_valueproperty to the right value. In the example below, it’s replaced by the mean value.

;; Creates a linearregression with some settings. Once it's
;; completed the ID is stored in my_linearregression variable
(define my_linearregression
(create-and-wait-linearregression {"dataset" training_dataset
"number_of_model_candidates" 10
"default_numeric_value" "mean"}))

NOTE: Property names always need to be between quotes and the value should be expressed in the appropriate type, a string or a number in the previous example. The equivalent code for the BigML Python Bindings becomes:

from bigml.api import BigML
api = BigML()
args = {"max_iterations": 100000, "default_numeric_value": "mean"}
training_dataset ="dataset/59b0f8c7b95b392f12000000"
my_linearregression = api.create_prediction(training_dataset, args)

For the complete list of properties that BigML offers, please check the dedicated API documentation.

Once the Linear Regression has been created,  as usual for supervised resources, we can evaluate how good its performance is. Now, we will use a different dataset with non-overlapping data to check the Linear Regression performance.  The “test_dataset” parameter in the code shown below represents the second dataset. Following the motto of “less is more”, the WhizzML code that performs an evaluation has only two mandatory parameters: a Linear Regression to be evaluated and a Dataset to use as test data.

;; Creates an evaluation of a linear regression
(define my_linearregression_ev
(create-evaluation {"linearregression" my_linearregression "dataset" test_dataset}))

Handy, right? Similarly, using Python bindings, the evaluation is done with the following snippet:

from bigml.api import BigML
api = BigML()
my_linearregression = "linearregression/59b0f8c7b95b392f12000000"
test_dataset = "dataset/59b0f8c7b95b392f12000002"
evaluation = api.create_evaluation(my_linearregression, test_dataset)

Following the steps of a typical workflow, after a good evaluation of your Linear Regression, you can make predictions for new sets of observations. In the following code, we demonstrate the simplest setting, where the prediction is made only for some fields in the dataset.

;; Creates a prediction using a linearregression with specific input data
(define my_prediction
(create-prediction {"linearregression" my_linearregression
"input_data" {"sepal length" 2 "sepal width" 3}}))

The equivalent code for the BigML Python bindings is:

from bigml.api import BigML
api = BigML()
input_data = {"sepal length": 2, "sepal width": 3}
my_linearregression = "linearregression/59b0f8c7b95b392f12000000"
prediction = api.create_prediction(my_linearregression, input_data)

In both cases, WhizzML or Python bindings, in the input data you can use either the field names or the field IDs. In other words, “000002”: 3 or “sepal width”: 3 are equivalent expressions.

As opposed to this prediction, which is calculated and stored in BigML servers, the Python Bindings (and other available bindings) also allow you to instantly create single local predictions on your computer or device. The Linear Regression information will be downloaded to your computer the first time you use it (connectivity is needed only the first time you access the model), and the predictions will be computed locally on your machine, without any incremental costs or latency:

from bigml.linearregression import Linearregression
local_linearregression = Linearregression("linearregression/59b0f8c7b95b392f12000000")
input_data = {"sepal length": 2, "sepal width": 3}
local_linearregression.predict(input_data)

It is similarly pretty straightforward to create a Batch Prediction in the cloud from an existing Linear Regression, where the dataset named “my_dataset” contains a new set of instances to predict by the model:

;; Creates a batch prediction using a linearregression 'my_linearregression'
;; and the dataset 'my_dataset' as data to predict for
(define my_batchprediction
(create-batchprediction {"linearregression" my_linearregression
"dataset" my_dataset}))

The code in Python Bindings that performs the same task is:

from bigml.api import BigML
api = BigML()
my_linearregression = "linearregression/59d1f57ab95b39750c000000"
my_dataset = "dataset/59b0f8c7b95b392f12000000"
my_batchprediction = api.create_batch_prediction(my_linearregression, my_dataset)

# Want to know more about Linear Regressions?

Our next blog post, the last one of this series, will cover how Linear regressions work behind the scenes, diving into the technical implementation aspects of BigML’s latest resource. If you have any questions or you’d like to learn more about how Linear Regressions work, please visit the dedicated release page. It includes links to this series of six blog posts, in addition to the BigML Dashboard and API documentation.

BigML, the leading Machine Learning platform, and GoHub from Global Omnium join forces with a strategic partnership to boost Machine Learning adoption throughout the startup and industry sectors. This partnership helps the tech and business sectors apply Machine Learning in their companies, provides them with Machine Learning education and helps them remain competitive in the marketplace.

BigML can now enjoy the GoHub offices, the new open innovation hub created by Global Omnium (the leading company in the water sector) and first startup accelerator specialized in Machine Learning with the collaboration of BigML. BigML, with headquarters in Corvallis, Oregon. USA, and Valencia, Spain, offers since 2011 the leading Machine Learning platform that helps almost 90.000 analysts, scientists and developers worldwide implement their own predictive applications with BigML technology.

Plenty of startups and many business sectors already apply Machine Learning techniques to automate processes in HR departments to hire the right employees; predict demand to avoid inventory problems; perform predictive maintenance to avoid production loss; timely fraud detection; predict energy savings, among many other Machine Learning applications in the real world.

BigML and GoHub will launch a full program of Machine Learning activities and events that will focus on providing the best quality content for institutions, big corporations, middle and small companies as well as startups. To achieve this goal, there will be events in Valencia, across Spain, and in several countries that will be announced shortly and will explain the impact that Machine Learning is having and can have on businesses for industry, finance, marketing, Human Resources, security, and more. All of them to be explained by companies that are already working with this technology. Additionally, there will be Machine Learning workshops to help ease the learning curve of those companies that wish to learn and implement Machine Learning.

Moreover, this partnership will encourage the creation of new synergies between products and services from both parties, GoHub and BigML. An example of this is the IoT Industrial platform Nexus Integra, which is already creating its own predictive application with BigML.

Francisco Martin, BigML’s CEO highlights: “I’m very happy that in Valencia there are companies like Global Omnium that champion and work hard for these disruptive technologies to succeed, allowing local talent to produce high-quality exportable technology bringing wealth to the city. With that, young talent won’t have the urge to emigrate to other countries in order to find a good job (as I did some time ago), which is a very positive development.”

“Innovation ecosystems, especially when there is a collaboration between startups and big corporations, are a great engine of growth and competitiveness. These kinds of partnerships are key to accelerating the potential of such ecosystems. Having BigML as a partner will allow the Valencian ecosystem to reach a higher level when it comes to disruptive technologies.”

In this fourth post of our series, we want to provide a brief summary of all the necessary steps to create a Linear Regression using the BigML API. As mentioned in our earlier posts, Linear Regression is a supervised learning method to solve regression problems, i.e., the objective field must be numeric.

The API workflow to create a Linear Regression and use it to make predictions is very similar to the one we explained for the Dashboard in our previous post. It’s worth mentioning that any resource created with the API will automatically be created in your Dashboard too so you can take advantage of BigML’s intuitive visualizations at any time.

In case you never used the BigML API before, all requests to manage your resources must use HTTPS and be authenticated using your username and API key to verify your identity. Find below a base URL example to manage Linear Regressions.

https://bigml.io/linearregression?username=$BIGML_USERNAME;api_key=$BIGML_API_KEY

You can find your authentication details in your Dashboard account by clicking in the API Key icon in the top menu.

The first step in any BigML workflow using the API is setting up authentication. Once authentication is successfully set up, you can begin executing the rest of this workflow.

export BIGML_USERNAME=nickwilson
export BIGML_API_KEY=98ftd66e7f089af7201db795f46d8956b714268a
export BIGML_AUTH="username=$BIGML_USERNAME;api_key=$BIGML_API_KEY;"

You can upload your data in your preferred format, from a local file, a remote file (using a URL) or from your cloud repository e.g., AWS, Azure etc. This will automatically create a source in your BigML account.

First, you need to open up a terminal with curl or any other command-line tool that implements standard HTTPS methods. In the example below, we are creating a source from a local CSV file containing some house data listed in Airbnb, each row representing one house’s information.

curl "https://bigml.io/source?$BIGML_AUTH" -F file=@airbnb.csv ### 2. Create a Dataset After the source is created, you need to build a dataset, which serializes your data and transforms it into a suitable input for the Machine Learning algorithm. curl "https://bigml.io/dataset?$BIGML_AUTH"
-X POST
-H 'content-type: application/json'
-d '{"source":"source/5c7631694e17272d410007aa"}'

Then, split your recently created dataset into two subsets: one for training the model and another for testing it. It is essential to evaluate your model with data that the model hasn’t seen before. You need to do this in two separate API calls that create two different datasets.

• To create the training dataset, you need the original dataset ID and the sample_rate  (the proportion of instances to include in the sample) as arguments. In the example below, we are including 80% of the instances in our training dataset. We also set a particular seed argument to ensure that the sampling will be deterministic. This will ensure that the instances selected in the training dataset will never be part of the test dataset created with the same sampling hold out.

curl "https://bigml.io/dataset?$BIGML_AUTH" -X POST -H 'content-type: application/json' -d '{"origin_dataset":"dataset/5c762fcd4e17272d4100072d", "sample_rate":0.8, "seed":"myairbnb"}' • For the testing dataset, you also need the original dataset ID and the sample_rate, but this time we combine it with the out_of_bag argument. The out of bag takes the (1- sample_rate) instances, in this case, 1-0.8=0.2. Using those two arguments along with the same seed used to create the training dataset, we ensure that the training and testing datasets are mutually exclusive. curl "https://bigml.io/dataset?$BIGML_AUTH"
-X POST
-H 'content-type: application/json'
-d '{"origin_dataset":"dataset/5c762fcd4e17272d4100072d",
"sample_rate":0.8, "out_of_bag":true, "seed":"myairbnb"}'

### 3. Create a Linear Regression

Next, use your training dataset to create a Linear Regression. Remember that the field you want to predict must be numeric. BigML takes the last numerical field in your dataset as the objective field by default unless it is specified. In the example below, we are creating a Linear Regression including an argument to indicate the objective field. To specify the objective field you can either use the field name or the field ID:

curl "https://bigml.io/linearregression?$BIGML_AUTH" -X POST -H 'content-type: application/json' -d '{"dataset":"dataset/68b5627b3c1920186f000325", "objective_field":"price"}' You can also configure a wide range of the Linear Regression parameters at creation time. Read about all of them in the API documentation. Usually, Linear Regressions can only handle numeric fields as inputs, but BigML automatically performs a set of transformations such that it can also support categorical, text and items input fields. Keep in mind that BigML uses dummy encoding by default, but you can configure other types of transformations using the different encoding options provided. ### 4. Evaluate the Linear Regression Evaluating your Linear Regression is key to measure its predictive performance against unseen data. You need the linear regression ID and the testing dataset ID as arguments to create an evaluation using the API: curl "https://bigml.io/evaluation?$BIGML_AUTH"
-X POST
-H 'content-type: application/json'
-d '{"linearregression":"linearregression/5c762c6b4e17272d42000617",
"dataset":"dataset/5c762f3a4e17272d41000724"}'

### 5. Make Predictions

Finally, once you are satisfied with your model’s performance, use your Logistic Regression to make predictions by feeding it new data. Linear Regression in BigML can gracefully handle missing values for your categorical, text or items fields.

In BigML you can make predictions for a single instance or multiple instances (in batch). See below an example for each case.

To predict one new data point, just input the values for the fields used by the Linear Regression to make your prediction. In turn, you get a prediction result for your objective field along with confidence and probability intervals.

curl "https://bigml.io/prediction?$BIGML_AUTH" -X POST -H 'content-type: application/json' -d '{"linearregression":"linearregression/5c762c6b4e17272d42000617", "input_data":{"room":4, "bathroom":2, ...}}' To make predictions for multiple instances simultaneously, use the Linear Regression ID and the new dataset ID containing the observations you want to predict. curl "https://bigml.io/batchprediction?$BIGML_AUTH"
-X POST
-H 'content-type: application/json'
-d '{"linearregression":"linearregression/5c762c6b4e17272d42000617",
"dataset":"dataset/5c128e694e1727920d00000c",
"output_dataset": true}'


If you want to learn more about Linear Regression please visit our release page for documentation on how to use Linear Regression with the BigML Dashboard and the BigML API. In case you still have some questions be sure to reach us at support@bigml.com anytime!

This is the third post of our Linear Regression series. BigML is bringing Linear Regression to the Dashboard so that you can solve regression problems with the help of powerful visualizations to inspect and analyze your results. Linear Regression is not only one of the best-known but also one of the best-understood supervised learning algorithms. It has its roots in statistics but gets utilized in machine learning quite a bit as well.

In this post we would like to walk you through the common steps to get started with Linear Regression:

As usual, start by uploading your data to your BigML account. BigML offers several ways to do it, you can drag and drop a local file, connect BigML to your cloud repository (e.g., S3 buckets) or copy and paste a URL. BigML automatically identifies the field types. Field types and other source parameters can alternatively be configured by clicking in the source configuration option.

### 2. Create a Dataset

From your source view, use the 1-click dataset option to create a dataset, a structured version of your data ready to be used by a Machine Learning algorithm.

In the dataset view, you will be able to see a summary of your field values, univariate statistics, and the field histograms to analyze your data distributions. This view is really useful to see any errors or irregularities in your data. You can also filter the dataset by several criteria and create new fields using different pre-defined operations as needed.

Once your data is clean and free of errors you can split your dataset into two different subsets: one for training your model, and the other for testing. It is crucial to train and evaluate your model with different data to ensure it generalizes well against unseen data. You can easily split your dataset using the BigML 1-click option, which randomly sets aside 80% of the instances for training and 20% for testing.

### 3. Create a Linear Regression

Now you are ready to create the Linear Regression using your training dataset. You can use the 1-click Linear Regression option, which will create the model using the default parameter values. However, if you are a more advanced user and you feel comfortable tuning the Linear Regression parameters, you can do so by using the configure Linear Regression option.

The list below gives a brief summary of each of the configuration parameters. If you want to learn more about them please check the Linear Regression documentation.

• Objective field: select the field you want to predict. By default, BigML will take the last valid field in your dataset. Remember it must be numeric!
• Default numeric value: if your numeric fields contain missing values, you can easily replace them with the field mean, median, maximum, minimum or zero using this option. It is inactive by default.
• Weight field: set instance weights using the values of the given field. The value in the weight field specifies the number of times that row should be replicated when including it in the model’s training set.
• Bias: include or exclude the intercept in the Linear Regression formula. Including it yields better results in most cases. It is active by default.
• Field codings: select the encoding option that works best for your categorical fields. BigML will automatically transform your categorical values into 0 -1 variables to support non-numeric fields as inputs, which is a method known as dummy encoding. Alternatively, you can choose from two other types of codings: contrast coding or other coding. You can find a detailed explanation of each one in the documentation.
• Sampling options: if you have a very large dataset, you may not need all the instances to create the model. BigML allows you to easily sample your dataset at the model creation time.

In terms of performance, the focus for Linear Regression is whether to include or exclude the bias term. All other parameters also depend on the data, the domain and the use case you are trying to solve. It’s natural that you want to understand the strengths and weaknesses of your model and iterate trying different features and configurations. To do this, the model visualizations explained in the next point would be very helpful.

When your Linear Regression has been created you can use BigML’s insightful visualizations to dive into the model results and see the impact of your features on model predictions.

BigML provides a 1D chart, a partial dependence plot (PDP) and a coefficient table to analyze your results.

1D Chart and PDP

Both 1D chart and PDP provide visual ways to analyze the impact of one or more fields on predictions.

For the 1D chart, you can select one numeric input field in the x-axis. In the prediction legend to the right, you will see the objective field predictions as you mouse over the chart area. The chart can also show the 95% prediction interval band in blue. This means, for any given point on the x-axis, its y value will be within this blue range with 95% probability. You can choose to show or hide the interval band.

For the PDP, you can select two input fields, either numeric or categorical, one per axis and the objective field predictions will be plotted in the color heat map chart.

By setting the values for the rest of the input fields using the form below the prediction legend, you will be able to inspect the combined interaction of multiple fields on predictions.

Coefficients table

BigML also provides a table to display the coefficients learned by the Linear Regression. A positive coefficient indicates a positive correlation between the input field and the objective field, while a negative coefficient indicates a negative relationship.

### 5. Evaluate the Linear Regression

Like any supervised learning method, Linear Regression needs to be evaluated. Just click on the evaluate option in the 1-click menu and BigML will automatically select the remaining 20% of the dataset that you set aside for testing.

The resulting performance metrics to be analyzed are the same ones as for any other regression models predicting a continuous value.

You will get three regression measures in the green boxed histograms: Mean Absolute Error, Mean Squared Error and R Squared. By default, BigML also provides the measures of two other types of models to compare against your model performance. One of them uses the mean as its prediction and the other predicts a random value in the range of the objective field. At the very least, you would expect your model to outperform these weaker benchmarks. You can choose to hide either or both of the benchmarks.

For a full description of the regression measures see the corresponding documentation.

### 6. Make Predictions

In BigML, you can make predictions for a new single instance or multiple instances in batches.

Single predictions

Click in the Predict option and set the values for your input fields.

A form containing all your input fields will be displayed and you will be able to set the values for a new instance. At the top of the view, you will see the objective field prediction changing as you change your input field values.

Batch predictions

Use the Batch Prediction option in the 1-click menu and select the dataset containing the instances for which you want to know the objective field value.

You can configure several parameters of your batch prediction such as the option to include both confidence interval and prediction interval in the batch prediction output dataset and file. When your batch prediction finishes you will be able to download the CSV file and see the output dataset.

In this second post of our series, we’ll cover a use case example. Least squares linear regression is one of the canonical algorithms in the statistical literature. Part of the reason for this is that it’s very good as a pedagogical tool. It’s very easy to visualize, especially in two dimensions, a line going through a set of points, and the distances from the line to each point representing the error of the classifier.  Kind people have even created nice animations to help you:

And herein, we have machine learning itself in a nutshell. We have the training data (the points), the model (the line), the objective (the distances from point to line) and the process of optimizing the model against the data (changing the parameters of the line so that those distances are small).

It’s all very tidy and relatively easy to understand . . . and then comes the day of the Laodiceans, when you realize that not every function is a linear combination of the input variables, and you must learn about gradient trees and deep neural networks, and your innocence is shattered forever.

Part of the reason to prefer these more complex classifiers to simpler ones like linear regression (and its sister technique for classification, logistic regression) is that they are often generalizations of the simpler techniques. You can, in fact, view linear regression as a certain type of neural network; specifically one with no hidden layers, non-linear activation functions, or fancy things like convolution and recurrence.

So, then, what’s the use in turning back to linear regression, if we already have other techniques that do the same and more besides?  Answers to this question often come in two flavors:

1. You need speed.  Fitting a neural network can take a long time whereas fitting a linear regression is near-instantaneous even for medium-sized datasets. Similarly for prediction:  A simple linear model will, in general, be orders of magnitude faster to predict than a deep neural network of even moderate complexity. Faster fits let you iterate more and focus on feature engineering.
2. You have small training data.  When you don’t have much training data, overfitting becomes more of a concern; complex classifiers may fit the data a bit better, but if you have small training data, your test sets are very small and so your estimates of goodness of fit become unreliable. Using models like linear regression reduces your risk of overfitting simply by giving you less variables to fit.

We’re going to focus on the second case for the rest of this blog post. One way of looking at this is the classic view in machine learning theory that the more parameters your model has, the more data you need to fit those properly.  This is a good and useful view. However, I find it just as useful to think about this from the opposite direction: We can use restrictive modeling assumptions as a sort of “stand in” for the training data we don’t have.

Consider the problem of using a decision tree to fit a line. We’ll usually end up with a sort of “staircase approximation” to the line. The more data we have, the tighter the staircase will fit the line, but we can’t escape the fact that each “step” in the staircase requires us to have at least one data point sitting on it, and we’ll never get a perfect fit.

This is unfortunate, but the upside is loads of flexibility. Decision trees don’t care a lick whether the underlying objective is linear or not; you can do the same sort of staircase approximation to fit any function at all.

Using linear regression allows us to sacrifice flexibility to get a better fit from less data.  Consider again the same line. How many points does it take from that line for linear regression to get a perfect fit?  Two. The minimum error line is the one and only line that travels through both points, which is precisely the line you’re looking for. No, we can’t fit all or even most functions with linear regressions, but if we restrict ourselves to lines, we can find the best fit with very little data.

### Linear Regression: More Power

Some of you may find the reasoning implied above to be a bit circular: “You can learn a very good model using linear regression, provided that you know in advance that a line is a good fit to the data.” It’s a fair point, but it can be surprising how often you find that this logic applies. It’s not odd to have a set of features, where changes in those features induce directly proportional changes in the objective, simply because those are the sorts of features amenable to machine learning in general. And in fact, these sorts of relationships abound, especially in the sciences, where linear and quadratic equations go much of the way towards predicting what happens in the natural world.

As an example, here’s a dataset of buildings that has measurements of the roof surface area and wall surface area for all of them, and the heat loss in BTUs for the building. Physics tells us that heat loss is proportional to these areas, and you break them out into roof and wall surface areas because those things are insulated differently. The dataset also has only 12 buildings, so we’ll use nine for training and three for test. Is it possible to get a reasonable model using so little data?

If we try a vanilla ensemble of 10 trees, we get an r-squared on the holdout set of 0.85. This isn’t bad, all things considered! Again, we’ve only got nine training points, so something so well-correlated to the objective is pretty impressive.

Now, let’s see if we can do better by making linear assumptions. After all, we said at the top that heat loss is, in fact, proportional to the given surface areas. Lo and behold, linear models serve us well: We are able to recover the “true” model for heat loss through a surface near-exactly.

One caveat here is that you’re only evaluating on three points, so it’s hard to know if the performance difference we see is significant. We might want to try cross-validation to see if the results continue to hold. However, Occam’s razor principle implores to choose the simpler model even if their performances are equal, and prediction will be faster to boot.

### Old Wine in New Bottles

Yes, linear regression is somewhat old-fashioned, and in this day and age where datasets are getting larger all the time, the use cases aren’t as many as they used to be.  We make a mistake, though, to equate “fewer” with “none”.  When you’ve got small data and linear phenomena, linear regression is still queen of the castle.

BigML’s upcoming release on Thursday, March 21, 2019, will be presenting our latest resource to the platform: Linear Regressions. In this post, we’ll do a quick introduction to General Linear Models before we move on to the remainder of our series of 6 blog posts (including this one) to give you a detailed perspective of what’s behind the new capabilities. Today’s post explains the basic concepts that will be followed by an example use case. Then, there will be three more blog posts focused on how to use Linear Regression through the BigML DashboardAPI, and WhizzML for automation. Finally, we will complete this series of posts with a technical view of how Linear Regressions work behind the scenes.

## Understanding Linear Regressions

Linear Regression is a supervised Machine Learning technique that can be used to solve, you guessed it, regression problems. Learning a linear regression model involves estimating the coefficients values for independent input fields that together with the intercept (or bias), determine the value of the target or objective field. A positive coefficient (b_i > 0), indicates a positive correlation between the input field and the objective field, while negative coefficients (b_i < 0) indicate a negative correlation. Higher absolute coefficient values for a given field can be interpreted to have a greater influence on final predictions.

By definition, the input fields (x_1, x_2, …, x_n) in the linear regression formula need to be numeric values. However, BigML linear regressions can support any type of fields by applying a set of transformations to categorical, text, and items fields. Moreover, BigML can also handle missing values for any type of field.

It’s perhaps fair to say linear regression is the granddaddy of statistical techniques that is required reading for any Machine Learning 101 student as it’s considered a fundamental supervised learning technique. Its strength is in its simplicity, which also implies it is pretty easy to interpret vs. most other algorithms. As such, it makes for a nice quick and dirty baseline regression model similar to Logistic Regression for classification problems. However, it is also important to grasp those situations linear regression may not be the best fit, despite its simplicity and explainability:

• It works best when the features involved are independent or to put it another way less correlated with one another.
• The method is also known to be fairly sensitive to outliers.  A single data point far from the mean values can end up significantly affecting the slope of your regression line, in turn, hurting the models chances to better generalize come prediction time.

Of course, using the standardized Machine Learning resources on the BigML platform, you can mitigate these issues and get more mileage from the subsequent iterations of your Linear Regressions. For instance,

• If you have many columns in your dataset (aka a wide dataset) you can use Principal component analysis (PCA) to transform such a dataset in order to obtain uncorrelated features.
• Or, by using BigML Anomalies, you can easily identify and remove the few outliers skewing your linear regression to arrive at a more acceptable regression line.

Here’s where you can find the Linear Regression capability on the BigML Dashboard:

## Want to know more about Linear Regressions?

If you would like to learn more about Linear Regressions and find out how to apply it via the BigML Dashboard, API or WhizzML please stay tuned for the rest of this series of blogs posts to be published in the next week.