The first edition of our Machine Learning Summer School in The Netherlands is here! On July 8-11, at the Nyenrode Business Universiteit, in Breukelen, executives, managers, decision makers, technology and business professionals, will get a good overview of how Machine Learning has evolved and where the industry is headed, both from a technical and business perspective. We will learn the must-know core Machine Learning concepts and techniques and innovative real-world use cases to understand how companies are already applying it. Moreover, we will cover the benefits and ways of adopting Machine Learning in any size organization. Finally, the attendees will have the chance to immediately apply the new skills learned as part of a practical workshop.

# The Distinguished Lecturers

#### Machine Learning: Why Now?

by Atakan Cetinsoy, VP of Predictive Applications at BigML.

#### Machine Learning for Managers

by Jan W VeldsinkNyenrodeRabobank, and Grio. Check out his interview about the event here!

#### Machine Learning Put to Use

by Mercè Martín Prats, VP of Insights and Applications at BigML.

#### Automating your own Machine Learning Projects

by jao, Co-Founder and Chief Technology Officer at BigML.

Check the full agenda for more details on each talk and get your ticket today if you don’t have it yet!

Thirdware Inc is a leader in enterprise applications that has supported a range of Fortune 500 organizations as an implementation partner for infrastructure technologies such as enterprise resource planning, enterprise performance management, cloud services, and robotic process automation. Today, we are happy to announce our partnership with Thirdware Inc to accelerate enterprise adoption of Machine Learning.

The partnership is a continuation of Thirdware’s long-term growth strategy, which has been a key focus for Thirdware CEO, Bhavesh Shah. “Our mission at Thirdware has always been to build the most comprehensive portfolio of technology solutions to support the rapidly evolving ecosystem of emerging technologies relevant for the automotive industry. Given BigML’s pioneering work as a Machine Learning platform, we see the addition as a natural fit to bring substantial value to the automotive industry,” said Shah.

# A New Group Within Thirdware

In tandem with this partnership, Thirdware has formalized a new group within the company called Thirdware Labs and has brought on former Ernst & Young, Fiat Chrysler Automotive, and Ford Motor Company Executive, Kristin Slanina, as its first Chief Transformation Officer.

“I’ve spent my career in the heart of the evolving automotive industry and have seen the challenges of adopting emerging technology become a huge barrier in the board-room, which continues to be a key topic in the C-suite today. We believe Machine Learning can catapult the mobility ecosystem, as well as other industry verticals, into a new universe of monetization. The BigML partnership represents a major step towards achieving that vision,” said Kristin Slanina, Thirdware’s Chief Transformation Officer.

# How BigML & Thirdware Will Add Value Together

For BigML, this partnership serves as an equally important milestone. The company will soon announce its 100,000 registered user milestone on its software-as-a-service Machine Learning platform. “Since the inception of BigML, the team and I have always focused on building the most complete, methodologically robust, and easy-to-use Machine Learning platform in the marketplace. Now, with Thirdware’s leadership and expertise in the enterprise, we are confident that we can further reduce the barriers that most teams within large companies have when it comes to solving problems using Machine Learning,” said Dr. Francisco Martin, Co-founder & Chief Executive Officer of BigML.

As it continues its momentum in 2019, there are two other areas in which Thirdware will continue to build capabilities in unison with its new Machine Learning offering. This includes blockchain and connectivity. By the end of 2019, Thirdware Labs will extend its offerings in these emerging technologies to other industry verticals such as healthcare, finance, and energy. In many ways, the application of these emerging technologies across the industry verticals undergoing the most disruptive change represents the long-term vision of Thirdware.

Earlier this year, Thirdware tapped Mohammad Hamid, former Chief Executive Officer of Unison, to join its team as a principal within Thirdware Labs. “While there has been a proliferation of advanced technologies over the past 5-6 years, many large enterprises still struggle to evaluate, implement, and scale these technologies across multiple regions, hundreds of thousands of employees, and multiple business units. With the pedigree of Thirdware over the past few decades and partnerships with cutting-edge technology companies like BigML, we believe that Thirdware can bridge this gap,” said Hamid.

Commenting on the new Partnership, Dr. José Antonio Ortega, BigML’s Co-founder and Chief Technology Officer said that “Machine Learning is reaching a level of maturity making it feasible for any enterprise to adopt and automate sophisticated tasks formerly managed by trusted human experts manually. With BigML, this wave of innovation can be further standardized and streamlined, which results in many robust and easily reproducible custom workflows each augmenting a specific decision-making process. The combination of Thirdware’s pedigree in Enterprise IT services and BigML’s machine learning expertise will help countless enterprises make the transition to The Fourth Industrial Revolution by optimizing their businesses in a cost effective manner,” said Ortega.

# Next Steps & Contact Information

The path ahead for Thirdware customers now includes the option to utilize the BigML platform. Furthermore, Thirdware will provide professional services related to BigML including: data preparation, advanced feature engineering, advanced modeling and prediction strategy, model operationalization, and measuring the business impact of automated process by machine-learned models. For more information on Thirdware’s Machine Learning capabilities or the broader portfolio of capabilities in emerging technologies, please contact

Our very first Machine Learning Summer School in The Netherlands (#DutchMLSchool) is taking place on July 8-11, 2019 in Breukelen, The Netherlands, in collaboration with Nyenrode Business University.

Today, we have two important announcements regarding the program that is shaping up to be our strongest one yet. Firstly, we are very happy to share that BigML’s Chief Scientist, one of the founders of the field of Machine Learning, Professor Tom Dietterich will be a presenting. Professor Dietterich’s historical perspective on the evolution and the current state of Machine Learning is unmatched so the audience will be treated to quite a journey regarding the most salient topics in both applied research and advances in various industry verticals.

Our second addition as a lecturer is none other than Enrique Dans, Professor of Information Systems at IE Business School, who has previously participated in our Valencian Summer School in Machine Learning 2017 to great fanfare. Dr. Dans will share his expertise on the overall business impact and the future strategic planning implications of innovative technologies such as Machine Learning across our global digitized economy with relevant use case examples.

# Machine Learning Put to Use

One such great real-world example is from Rabobank‘s recent experience in deploying a custom Machine Learning solution for fraud detection built on top of the BigML platform. Below is a video about the implementation approach and early results, as well as how they adopted Rabobank across the organization to whet your appetite on what’s to come your way should you attend the #DutchMLSchool.

As always, the program will be blending a healthy mix of most versatile Machine Learning techniques as well as real-world predictive use cases so the participants will have a very grounded view of the possibilities that Machine Learning can unlock in their business contexts. Spread over the course of four days, attendees of different profiles will be able to find impactful content that best fits their roles:

• Machine Learning for Executives – July 8 (day 1): A C-level course on Machine Learning, ideal for business leaders and senior executives in all industries. Attendees will be able to understand how Machine Learning can be adopted in any organization, focusing on the strategy to follow as well as the key points that managers should know when making decisions.
• MAIN CONFERENCE: Introduction to Machine Learning – July 9 and 10 (days 2-3): A two-day crash course designed for business innovators, industry practitioners, as well as students, seeking a quick, practical, and hands-on introduction to Machine Learning to solve real-world problems.
• Working with the Masters – July 11 (day 4): A full day of learning with the Machine Learning masters that helps put theoretical concepts into practice in a hands-on manner. This course is tailored for experienced business analysts, data scientists, and Machine Learning practitioners that wish to work on real-world data. Attendees will be able to bring their own data.

If you don’t want to miss out on this great opportunity to add to your hands-on analytical skills and be more knowledgeable about this foundational technology, be sure to register and take advantage of our early bird rates today!

Two weeks ago, I had the chance to conduct a workshop at The University of California, Berkeley’s Haas School of Business as part of Professor Gregory La Blanc’s Data Science and Strategy class for MBAs and business leaders. This meant showcasing a subset of the comprehensive Machine Learning capabilities of the BigML platform such as Models (Decision Trees), Logistic Regressions and Ensembles while solving some example use cases centered around the predictive use cases of disease diagnostics and credit risk analysis. The best of it was that those in the classroom got to replicate those use cases in their own BigML accounts instead of passively observing.

According to the syllabus, the objective of the Data Strategy course is to provide an understanding of the role of data and statistical analysis in managerial decision-making with a specific focus on the role of managers as both consumers and producers of information, illustrating how finding and/or developing the right data and applying appropriate statistical methods can help solve problems in business. As such, the main focus areas are developing literacy within the potentially intimidating field of quantitative analytics and the ability to assess existing business models from that analytical prism.

As an MBA that has followed a career trajectory spanning highly data-driven roles such as marketing analytics, software product management, and business intelligence I have consistently the beneficiary of following an empirical approach informed by insights based on business data harvested from various systems of record.

After the workshop, I’m very encouraged to have seen the conviction and the resolve from tomorrow’s MBA candidates to own up to the “In God we trust, but all else bring data” mentality. In addition to that broader impression, I’d like to share some findings from an informal survey shared with the attendees.

• The class had a good mix of those with technical degrees (engineering, math, etc.) and non-technical degrees.
• Based on survey feedback, more than two-thirds of the class did not have any prior experience with Machine Learning whatsoever. The remaining ones had some limited exposure in the form of self-learning or a related class they took as part of their former technical education. With that said, none had practiced Machine Learning in their prior careers. All in all, they were newbies to Machine Learning.
• On a very positive note, after the workshop, most respondents thought Machine Learning can be described as a more advanced form of analytics while some opined that it’s also increasingly a must-learn skill set for any white-collar professional. Interestingly, no attendees mentioned that Machine Learning is too complex and confusing or “overhyped” even though those were also offered as attitudinal choices. We’ve been observing this new behavior for multiple years now. Some refer to it as the Citizen Data Scientist movement even though I don’t much fancy that phrase but am fully in support of the core concept it represents.
• Perhaps the most interesting feedback was related to the main motives in learning Machine Learning. Almost all respondents agreed that they would like to be able to better communicate with Machine Learning specialists or Data Engineers in their future jobs by having a good grasp of the core concepts of Machine Learning ( e.g., cut through ‘hype’ or jargon) as well as being self-sufficient when it comes to discovering insights in business data they have direct access to. Following those top two reasons was the perception that Machine Learning has become a highly desirable skill by employers potentially giving them an edge when re-entering the job market. Close behind that third motivation was the fact that some find Machine Learning intellectually stimulating regardless of its implications on their future career. I suspect those were skewed to the left-brained ones with technical degrees.
• Last but not least, almost everyone in the classroom thought that they were likely to use BigML especially when they are considering a new predictive use case where they have access to relevant business data.

I predict future business leaders will follow in the footsteps of examples like NDA Lynn such that they won’t be afraid to autonomously initiate and execute their search for new business insights with or without help from scientists and/or researchers in their organizations. We’ll keep tirelessly promoting the promise and potential of Machine Learning and see how far we can take this prediction.

BigML and Nyenrode Business Universiteit are thrilled to announce the first edition of our Machine Learning Summer School in The Netherlands! The four-day event will take place at Nyenrode Business University, in Breukelen, and the program is designed to cater to different professional profiles and their needs:

• Machine Learning for Executives – July 8 (day 1): A C-level course on Machine Learning, ideal for business leaders and senior executives in all industries. Attendees will be able to understand how Machine Learning can be adopted in any organization, focusing on the strategy to follow as well as the key points that managers should know when making decisions. Additionally, we will see several real-world success stories presented by companies that are currently applying Machine Learning techniques.
• MAIN CONFERENCE: Introduction to Machine Learning – July 9 and 10 (days 2-3): A two-day crash course designed for business innovators, industry practitioners, as well as students, seeking a quick, practical, and hands-on introduction to Machine Learning to solve real-world problems. The content presented during these two days will serve as a good introduction to the kind of work that students can expect if they enroll in advanced Machine Learning and AI Masters.
• Working with the Masters – July 11 (day 4): A full day of learning with the Machine Learning masters that helps put theoretical concepts into practice in a hands-on manner. This course is tailored for experienced business analysts, data scientists, and Machine Learning practitioners that wish to work on real-world data and real use cases; a unique opportunity to work with leading Machine Learning experts. Attendees will be able to bring their own data.

# Where

Nyenrode Business Universiteit, Straatweg 25, 3621 BG Breukelen, The Netherlands. See map here.

# When

4-day event: on July 8-11, 2019 from 8:30 AM to 5:00 PM CEST.

# Tickets

Please purchase your ticket(s) here. We recommend that you register soon as space is limited. You can join the complete four-day event for a full experience or just the courses you find most interesting!

# Schedule

You can check out the full agenda and other details of the event here.

# Networking

Get to know the lecturers and speakers and other attendees during the networking breaks and dinners we offer after the sessions. We expect hundreds of locals as well as Machine Learning practitioners and experts attending from all around the world!

Do not hesitate to contact us at education@bigml.com if you would like to co-organize a Machine Learning School in your city, as we look forward to growing the Machine Learning Schools series!

I took this photo at the Valencian Summer School in Machine Learning 2018. That was my second Summer School, but my first one as a BigML intern. My internship had just started few days ago. Since I published this tweet last September things changed a lot, but let me provide a context for it.

What happened between both Summer Schools? I realized that almost all my viewpoints about Machine Learning were wrong.

I belong to the most adaptive and agile generation ever. People call us Millennials. We were born during the dot-com bubble, and we lived through the dot-com crash. We saw the first iPhone keynote and the transformation from taxis to Ubers and hotels to Airbnbs. We know hype well, and we’re starting to learn how to separate hype from real value. It was in my first Valencian Summer School, and more specifically, during the Enrique Dans talk, when I decided to unlearn everything I had been told previously about Machine Learning.

I forgot about killer robots, having machines replacing doctors or trying to build KITT. Instead, I started to think about finding patterns in data that can help doctors making decisions, reduce waste of energy or help to save lives by preventing disasters.

In the same way, I forgot about unaffordable GPUs, tons of hours of programming every single line of every single ML algorithm or the frustration of not being able to find the best hyper-parameters for my  model. Instead, I started to focus on the problem, not the tool, and let BigML do the rest for me. After all, why shy away from standing on the shoulders of giants?

And that was my philosophy during this internship. I got certified as a BigML Engineer, worked on multiple real-world use cases and created workflows with WhizzML to perform Feature Selection. And then, I met one of those giants to stand on, Jao, BigML’s CTO. With him, I started working on BigML’s backend, called wintermute.

I discovered the benefits of functional programming with Jao, and he even introduced me to the emacs religion! The experience I gained with WhizzML helped me to move forward and abandon the Algol family of languages. Clojuredocs was my homepage during those days, and it still is.

There is an interesting internal project in which I’ve been involved that I would also like to mention. It’s called Neuromancer. With Neuromancer, we can see how well our resources scale, beyond the Big-O notation. It let us test possible optimizations for all BigML’s models.

Looking back, the journey has been long, but this is only the beginning. Now, as a full-time employee of BigML, I will keep contributing to our mission of democratizing Machine Learning as it penetrates all corners of our globe. Just like a bamboo plant, we’ve planted it a while back on stable ground, and we now see a few new bamboo shoots growing each and every day. But, soon enough, when the roots are fully established underground, it will grow as crazy, positively impacting all Millenial careers for decades to come.

In my previous two posts in this series, I’ve essentially argued both sides of the same issue. In the first, I explained why deep learning is not a panacea, when machine learning systems (now and likely always) will fail, and why deep learning in its current state is not immune to these failures.

In the second post, I explained why deep learning, from the perspective of machine learning scientists and engineers, is an important advance: Rather than a learning algorithm, deep learning gives us a flexible, extensible framework for specifying machine learning algorithms. Many of the algorithms so far expressed in that framework give orders of magnitude-level improvement on the performance of previous solutions. In addition, it’s a tool that allows us to tackle some problems heretofore unsolvable directly by machine learning methods.

For those of you wanting a clean sound-byte about deep learning, I’m afraid you won’t get it from me. The reason I’ve written so much here is that I think nature of the advance that deep learning has brought to machine learning is complex and defies broad judgments, especially at this fairly early stage in its development. But I think it is worth it to take a step backward and try to understand which judgments are important and how to make them properly.

### Flaky Machines or Lazy People?

This series of posts was motivated in part by my encounters with Gary Marcus’ perspectives on deep learning. At the root of his positions is the notion that deep learning (and here he means “statistical machine learning”) is, in various ways, “not enough”. In his medium post, it’s “not enough” for general intelligence, and in the synced interview it’s “not enough” to be “reliable”.

This notion of whether current machine learning systems are “good enough” gets to the heart of the back and forth on deep learning. Marcus cites driverless cars as an example of how AI isn’t mature enough yet to rely on 100%, and that AI needs a “foundational change” to ensure a safe level of reliability. There’s a bit of ambiguity in the interview about what he means by AI, but my own impression is that this is less of a critique of machine learning, and more of a critique of the software around it.

For example, we have vision systems able to track and identify pedestrians on the road. These systems, as Marcus says, are mostly reliable but certainly make occasional mistakes. The job of academic and corporate researchers is to create these systems and make them as error-free as possible, but in the long run, they will always have some degree of unreliability.

Something consumes the predictions of these vision systems and acts accordingly; it is and always will be the job of that thing to avoid treating these predictions as the unvarnished truth. If the predictions were guaranteed to be correct, the consumer’s job would be much easier. As it is, consuming the predictions of a vision system requires some level of cleverness and skepticism. Maybe that cleverness involves awareness of separate sensor systems or other information streams like location and time of day. It might require symbolic approaches of the type Marcus favors. It might require more and very different deep learning, as Yann LeCun suggests. It might require something that’s entirely new.

Designing software that works properly with machine-learned models is hard. You have to do the difficult work of characterizing the model’s weaknesses and engineering around them. But critical readers should reject the notion that machine learning needs to provide extreme reliability on its own in order to be useful in mission critical situations. If a vision system can accurately find and track 95% of pedestrians, and other sensors and logic pick up the remaining 5%, you’ve arrived at “enough” without having a perfect model.

### When is “Enough” Enough?

So then the question becomes, “are we there yet?” with current ML systems. That depends, of course, on how good we think we need them to be for the engineers and domain experts to pull their outputs across the finish line. There are a lot of areas in which deep learning puts us in shouting distance, but it general, whether or not we’re there yet depends in turn on what you want the system to do and the quality of your engineers. When thinking about that question, though, it’s important to consider that the finish line might not be exactly where you think it is.

Consider the problem of machine translation. Douglas Hofstadter wrote a great article where he systematically lays bare the flaws in state-of-the-art machine translation systems. He’s right: For linguistic ideas with even a little complexity, they’re not great and are at times totally unusable. But the whole article reminded me of a blog post Hal Daumé III wrote more than 10 years ago when he and I were both recent Ph.D.’s In it, he wonders how much of human translation is better than computer translation when you really consider everything (street signs, menus, simple interpersonal interactions, and so on). Again, he asks this more than ten years ago.

The point here is that if machine translation for these things is already noticeably better than the second-rate human translations we apply in practice (or was ten years ago), there’s already a sense in which the models we have are very much good enough. How it deals with more complex phrases and ideas is an interesting question, and might yield new research directions, but this is all academic as far as its applicability is concerned. The existing technology, imperfect as it is, has a use and a place in society.

Even less relevant is how “deep” the model’s knowledge is, or how “stupid” it is, or whether the algorithm is “actually learning” (whatever that means). These are all flavors of the “computers don’t really understand what they’re doing” argument that traces its way through Hofstadter, John Searle,  Alan Turing and dozens of other philosophers all the way back to Ada Lovelace. There are loads of counter-arguments (I have even spun out a few of my own versions), but maybe the most compelling reason to ignore these questions is that the answers are often less interesting than the answer to the question, “Can we use it?”

A number of years ago, my wife and I hosted two members of a Belgian boys choir that was on tour. Neither she nor I spoke any French, so we relied on Google Translate to communicate with them. To this day, I remember typing “We made a pie.  Would you like some?” into my phone and watching their faces light up as the translation appeared. Did the computer understand anything about pie, or generosity, or the happiness of children, or how its own flawed translations could help create indelible memories?  Probably not.  But we did!

### The Final Exam

The criticism that machine learning is not enough on its own to produce systems that exhibit reliably intelligent behavior is a broken criticism. Deep learning gets us part of the way towards such systems, perhaps quite a lot of the way, but does anyone think it’s necessary or even advisable to cede the entire behavior of, say, a car to a machine-learned model? Saying no doesn’t mean backing away from a fully-autonomous car; as Marcus himself points out, there are other techniques in AI and software at large that are better suited to certain aspects of these problems. There can be many layers of human-comprehensible logic sitting between deep learning and the gas pedal, and it’s likely the totality of the system, rather than the learned component alone, that will display behavior that we might recognize as intelligent.

Is it a flaw or a problem with deep learning when it can’t solve the aspects of these problems that no one really wants or needs solved?  I don’t think so. Again, paraphrasing Marcus (and myself), machine learning is a tool. If you buy a nail gun and it jams, then yeah, that’s a problem with the nail gun, but if you try to use a nail gun to cut a piece of wood in half, that’s more of a problem with you. Deep learning is a very important step forward in the evolution of the tool (and a large one compared to other recent steps), but that step doesn’t change its fundamental nature. No matter what improvements you make, a nail gun is never going to become a table saw. Certainly, it’s unethical and bad business for tool manufacturers to make inflated claims about their tool’s usefulness, but it’s finally the job of the operator to determine which tool to use and how to use it.

Pundits can argue all day long about how impactful deep learning is and how smart machine learning can possibly be, but none of those arguments will matter in the long run. As I’ve said before, the only real test of the usefulness of machine learning is if domain experts and data engineers can leverage it to create software that has value for other human beings. Therein lies the power, the only real power, of new technology and the only goal that counts.

In the first in this series of posts, I discussed a bit about why deep learning isn’t fundamentally different from the rest of machine learning, and why that lack of a difference implies many contexts in which deep networks will fail to perform at a human level of cognition, for a variety of reasons.

Is there no difference, then, between deep learning and the rest of machine learning? Is Stuart Russell right that deep learning is nothing more than machine learning with faster computers, more data, and a few clever tricks, and is Rich Sutton right that such things are the main drivers behind machine learning’s recent advances?

Certainly, there’s a sense in which that’s true, but there’s also an important sense in which it’s not. More specifically, deep learning as it stands now represents an important advance in machine learning for reasons mostly unrelated to the access to increasing amounts of computation and data. In the last post, we covered how understanding deep learning is the same as the rest of machine learning is the key to knowing some of the problems that deep learning does not solve. Here, let’s try to understand how deep learning is different and maybe along the way we’ll find some problems that it solves better than anything else.

### How To Create a Machine Learning Algorithm

Before talking about machine learning, it’s important to know how existing machine learning algorithms have been created by the academic community. Oversimplifying dramatically, there’s usually a two-step process:

1. Come up with some objective function
2. Find a way to mathematically optimize that function with respect to training data

In machine learning parlance, usually, the objective function is basically any way of measuring the performance of your classifier on some data. So one objective function would be “What percent of the time does my classifier predict the right answer?”. When we optimize that objective, we mean that we learn the classifier’s internal workings so that it performs well on the training data by that measurement. If the measurement were “percent correct” as above, it means that we learn a classifier that gets the right answer on the training data all of the time, or as close to it as possible.

An easy, concrete example is ordinary least squares regression: The algorithm seeks to learn weights to minimize the squared difference between the model’s predictions and the truth in the data. There’s the objective function (the sum of the squared differences) and the method used to optimize it (ordinary least squares).

There are a lot of different variations on this theme. For example, most versions of support vector machines for classification are based on the same basic objective, but there are a lot of algorithms to optimize that objective, from quadratic programming approaches like the simplex and interior point methods to cutting plane algorithms, to sequential minimal optimization. Decision trees admit a lot of different criteria to optimize (information gain, Gini index, etc.), but given that criteria, the method of optimizing it is basically the same.

Importantly, these two steps don’t usually happen in sequence because one depends on the other: When you’re thinking of an objective function, you immediately put aside ones that you know you can’t mathematically optimize, because there’s really no use for such a function that you can’t optimize against the data that you have. It turns out that the majority of objective functions you’d like can’t be efficiently optimized mathematically, and in fact, much of the machine learning academic literature is devoted to finding clever objective functions or new and faster ways of optimizing existing ones.

### From Algorithms to Objectives

Now we can understand the first-way deep learning is rather different from the usual way of doing machine learning. In deep learning, we typically use only one basic way of optimizing the objective function, and that’s using a family of algorithms known collectively as gradient descent. Gradient descent means a bunch of different things, but they all rely on knowing the gradient of the objective function with respect to every parameter in the model. This, of course, means calculus and calculus is hard.

In deep networks, this is even harder because of the variety of things like activations functions, topologies, and types of connections. The programmer needs to know the derivative of the objective with respect to every parameter in every possible topology in order to optimize it, then they also need to know the ins and outs of all of the gradient descent algorithms they want to offer. Engineering-wise, it’s a nightmare!

The saving grace here is that the calculus is very mechanical: Given a network topology in all of its gory detail, the process of calculating the gradients is based on a set of rules and it’s not a difficult calculation, just massive and tedious and prone to small mistakes. So a whole bunch of somebodies finally buckled down, got the collection of rules together and turned that tedious process into computer code. The upshot is that now you can use programming to specify any type of network you want, including some objective, then just pass in the data and all of the calculus is figured out for you.

This is called automatic differentiation, and it’s the main thing that separates deep learning frameworks like Theano, Torch, Keras, and so on from the rest of computation. You just have to be able to specify your network and your objective function in one of these frameworks, and the software “knows” how to optimize it.

Put another way, instead of telling the computer what to do you’re now just telling it what you want.

### More Than What You Need

It’s hard to overstate how much more flexible this is as an engineering approach than what we used to do. Before deep learning, you’d look at the rather small list of problems that machine learning was able to solve and try to cram your specific problem into one of those bins. Is this a classification or regression problem? Use boosting or SVMs. Is it a clustering problem? Use k-means or g-means. Is it metric learning? LMNN or RCA! Collaborative filtering? SVD or PLSI! Label sequence learning? HMMs or CRFs or even M3Ns! Depending on what your inputs and desired outputs look like, you choose an acronym and off you go.

This works well for a lot of things, but for others, there’s just not a great answer. What if your input is a picture and your output is a text description of that picture? Or what if your input is a sequence of words and your output is that sequence in a different language? Before, you’d say things like, “Well, even though it’s sequences, this is kind of a classification problem, so we can preprocess the input with a windowing function, then learn a classifier, then transform the outputs, then learn another classifier on those outputs, then apply some sort of smoothing” and so on. The alternative to this sort of shoehorning the problem into an existing bag was to design a learning algorithm from scratch, but both options come to much the same thing: A big job for someone with loads of domain knowledge and machine learning expertise.

With a deep learning framework, you just say “this is the input, here’s what the output should be, and here’s how to measure the difference between the model’s output and what it should be. Give me a model.” Does this always work? Certainly not; all of those acronyms are not going away anytime soon. But the fact that deep learning can often be made to work for problems of any of the types above with very little in the way of domain knowledge or mathematical insight, that’s a powerful new capability.

Perhaps you can also see how much more extensible this is vs. the previous way of doing things. Any operation that is differentiable can be “thrown onto the pile”, its rules added to the list of rules for differentiation and then can become a part of any network structure. One very important operation currently on the pile is convolution, which is the current star of the deep learning show. This isn’t a scientific innovation; that convolution kernels can be learned via gradient descent is almost 30 year-old news, but in the context of an engine that can automatically differentiate the parameters of any network structure, you end up using them in combination with things like residual connections and batch normalization which pushes their performance to new heights.

Maybe just as important as the flexibility of deep learning frameworks is the fact that the gradient descent typically happens in tiny steps, where a small slice of the training data updates the classifier at each one. This may not seem like a big deal, but it means that you can take advantage of effectively an infinite amount of training data. This allows you to use a simulator to augment your training data and make your classifier more robust, which is a natural fit in areas like computer vision and game playing. Other machine learning algorithms have the same sort of update behavior, but given the complexity of the networks needed to solve some problems and the amount of data needed to properly fit them, data augmentation becomes a game-changing necessity.

You’re probably starting to realize now that “Deep Learning” isn’t really a machine learning algorithm, per se. Closer to the truth is that it’s a language for expressing machine learning algorithms, and that language is still getting more expressive all the time. This is partially what our own Tom Dietterich means when he says that we don’t really know what the limits are for deep learning yet. It’s tough to see where the story ends if only because its authors are still being provided with new words they could say. To say that something will never be expressible in an evolving language such as this one seems premature.

### Some Seasoning

Now, even considering the huge caveat that is the first post in this series, I’d like to put forth a couple of additional grains of salt. First, the above narrative gives the impression of complete novelty to the compositional nature of deep learning, but that’s not entirely so. There have been previous attempts in the machine learning literature to create algorithmic frameworks of this sort. One such attempt that comes to mind is Thorsten Joachims’ excellent work developing SVM-struct, which does for support vector machines much of what automatic differentiation does for deep learning. While SVM-struct does allow you to attack a diverse set of problems, it lacks the ability to incorporate new operators that constantly expand the space of possible internal structures for the learned model.

Second, admittedly, I may have oversimplified things a bit. The complexity of creating or selecting a problem-specific machine learning algorithm has not disappeared entirely. It’s just been pushed into the design of the network architecture: Instead of having a few dozen off the shelf algorithms that you must understand, each with five or ten parameters, you’ve now got the entire world of deep learning to explore with each new problem you encounter. For problems that already have well-defined solutions, deep learning’s flexibility can be more of a distraction than an asset.

### The Proof Is In The Pudding

All of that said, it would be silly to talk about any of this if it hadn’t led to some absurd gains in performance in marquee problems. I’m old enough to have been a young researcher in 2010, when state-of-the-art results were claimed on CIFAR-10 at just over 20% error. Now, multiple papers claim error rates under 3% and even more claim sub-20% error on ImageNet, which has 100 times as many classes. We see similar sorts of improvement in object detection, speech recognition, and machine translation.

In addition, this formalism allows us to apply machine learning directly to a significantly broader set of possible problems like question answering, or image denoising, or <deep breath> generating HTML from a hand-drawn sketch of a web page.

Even though I saw that last one for the first time a year ago, it still makes me a bit dizzy to think about it. To someone in the field who spent countless hours learning how to engineer the domain-specific features needed to solve these problems, and countless more finessing classifier outputs into something that actually resembled a solution, seeing these algorithms work, even imperfectly, borders on a religious experience.

Deep learning is revolutionary from a number of angles, but most of those are viewable primarily from the inside of the field rather than the outside. Deep learning as it is now is not an “I’m afraid I can’t do that”-level advance in the field. But for those of us slaving away in the trenches, it can look very special.  Even the closing barb of Gary Marcus’ synced interview has a nice ring to it:

They work most of the time, but you don’t know when they’re gonna do something totally bizarre.

Really? Machine learning? Working most of the time? It’s music to my ears! I think Marcus is talking about driverless cars here, but roughly the same thing could be said of state-of-the-art speech recognition or image classification. The quote is an unintentional compliment to the community; Marcus doesn’t seem to be aware of how recent this level of success is, how difficult it was to get here, and how big of a part deep learning played in the most recent ascent. Why yes, it is working most of the time! Thank you for noticing!

With regard to the second half of that quote, clearly there’s work left to be done, but what does the rest of that work look like and who gets to decide? In the final post in this series, I’ll speculate on what the current arguments about machine learning say about its use and its future. So stay tuned…

Gary Marcus has emerged as one of deep learning’s chief skeptics. In a recent interview, and a slightly less recent medium post, he discusses his feud with deep learning pioneer Yann LeCun and some of his views on how deep learning is overhyped.

I find the whole thing entertaining, but at many times LeCun and Marcus are talking past each other more than with each other. Marcus seems to me to be either unaware of or ignoring certain truths about machine learning and LeCun seems to basically agree with Marcus’ ideas in a way that’s unsatisfying for Marcus.

The temptation for me to brush 10 years of dust off of my professor hat is too much to ignore. Outside observers could benefit greatly from some additional context in this discussion and in this series of posts I’ll be happy to provide some. Most important here, in my opinion, is to understand where the variety of perspectives come from, and where deep learning sits relative to the rest of machine learning. Deep learning is both an incremental advance and a revolutionary one. It’s the same old stuff and something entirely new. Which one you see depends on how you choose to look at it.

### The Usual Awfulness of Machine Learning

Marcus’ post, The Deepest Problem with Deep Learning is written partly in response to Yoshi Bengio’s recent-ish interview with Technology Review. In the post, Marcus comes off as a bit surprised that Bengio’s tone about deep learning is circumspect about its long term prospects, and goes on to reiterate some of his own long-held criticisms of the field.

Most of Marcus’ core arguments about deep learning’s weaknesses are valid and maybe more uncontroversial than he thinks: All of the problems with deep learning that he mentions are commonly encountered by practitioners in the wild. His post doesn’t belabor these arguments. Instead, he spends a good deal of it suggesting that the field is either in denial or deliberately misleading the public about the strengths and weaknesses of deep learning.

Not only is this incorrect, but it also unnecessarily weakens his top line arguments. In short, the problems with deep learning are worse than Marcus’ post suggests, and they are problems that infect all of machine learning. Alas, “confronting” academics with these realities is going to be met with a sigh and a shrug, because we’ve known about and documented all of these things for decades. However, it’s more than possible that, with the increased publicity around machine learning in the last few years, there are people out there who are informed about the field at a high-level while only tangentially aware of its well-known limitations. Let’s review those now.

### What Machine Learning Still Can’t Do

By now, examples of CNN-based image recognition being “defeated” by various unusual or manipulated input data should be old news. While the composition of these examples is an interesting curiosity to those in the field, it’s important to understand why they are surprising to almost no one with a background in machine learning.

Consider the following fake but realistic dataset of eight people, in which we know the height, weight, and number of pregnancies for eight people, and we want to predict their sex based on those variables:

 Height (in.) Weight (lbs.) Pregnancies Sex 72 170 0 M 71 140 0 M 74 250 0 M 76 300 0 M 69 160 0 F 65 140 2 F 60 100 1 F 63 150 0 F

Any reasonable decision tree induction algorithm will find a concise classifier (Height > 70 = Male else Female) that classifies the data perfectly. The model is certainly not perfect, but also not a terrible one by ML standards, considering the amount of data we have. It will almost certainly perform much better than chance at predicting peoples’ sex in the real world. And yet, any adult human will do better with the same input data. The model has an obvious (to us) blind spot: It doesn’t know that people over 5’10” who have been pregnant at least one time are overwhelmingly likely to be female.

This can easily be phrased in a more accusatory way: Even when given training data about men and women and the number of pregnancies each person has had, the model fails to encode any information at all about which sex is more likely to get pregnant!

It sounds pretty damning in those words; the model’s “knowledge” turns out to be incredibly shallow. But this is not a surprise to people in the field. Machine learning algorithms are by design parsimonious, myopic, and at the mercy of the amount and type of training data that you have. More problems are exposed when we allow the case of adversarially selected examples, where you are allowed to present examples constructed or chosen to “fool” the model. I’ll leave it as an exercise for the reader to calculate how well the classifier would do on a dataset of WNBA players and Kentucky Derby jockeys.

### Enter Deep Learning, To No Fanfare At All

Deep learning is not different (at least in this way) from the rest of statistical learning: All of the adversarial examples presented in the image recognition literature are more or less the same as the 5’11” person who’s been pregnant; there was nothing like that in the dataset, so there’s no good reason to expect the model would get it right, despite the “obviousness” of the right answer to us.

There are various machine learning techniques for addressing bits and pieces of this problem, but in general, it’s not something easily solvable within the confines of the algorithm. This isn’t a “flaw” in the algorithm per se; the algorithm is doing what it should with the data that it has. Marcus is right when he says that machine-learned models will fail to generalize to out-of-distribution inputs, but, I mean, come on. That’s the i.i.d. assumption! It’s been printed right there on the tin for decades!

Marcus’ assertion that “In a healthy field, everything would stop when a systematic class of errors that surprising and illuminating was discovered” presupposes that researchers in the field were surprised or problems illuminated by that particular class of errors. I certainly wasn’t, and my intuition is that few in the field would be. On the contrary, if you show me those images without telling me the classifier’s performance, I’m going to say something like “that’s going to be a tough one for it to get right”.

In the back-and-forth on Twitter, Marcus seems stung that the community is “dismissive” of this type of error, and scandalized that the possibility of this type of error isn’t mentioned in the landmark Nature paper on deep learning, and herein, I think, lies the disconnect. For the academic literature, this is too mundane and well-known of a limit to bother stating. Marcus wants a field-wide and very public mea culpa for a precondition of machine learning that was trotted out repeatedly during our classes in grad school. He will probably remain disappointed. Few in the community will see the need to restate that limitation every time there’s a new advance in machine learning; the existence of that limit is a part of the context of every advance, as much as the existence of computers themselves.

For communications with the public at large outside of the field, though, perhaps Marcus is right that such limits could take center stage a bit more often (as Bengio rightly puts them in his interview). Yes, it’s true! You can almost always find a way to break a machine learning model by fussing with the input data, and it’s often not even very hard! One more time for the people in the back:

People who think deep learning is immune to the usual problems associated with statistical machine learning are wrong, and those problems mean that many machine learning models can be broken by a weak adversary or even subtle, non-adversarial changes in the input data.

This makes machine learning sound pretty crummy, and again elicits quite a bit of hand-wringing from the uninitiated.  There are breathless diatribes about how machine learning systems can be, horror of horrors, fooled into making incorrect predictions!  They’re not wrong; if you’re in a situation where you think such trickery might be afoot, that absolutely has to be dealt with somewhere in your technology stack.  Then again, this is so even if you’re not using machine learning.

Fortunately, there are many, many cases where this sort of brittleness is just not that much of a problem. In speech recognition, for example, there’s no one trying to “fool” the model and languages don’t typically undergo massive distributional changes or have particularly strange and crucial corner cases. Hence, all speech recognition systems use machine learning and the models do well enough to be worth billions of dollars.

Yes, all machine-learned models will fail somehow. But don’t conflate this failure with a lack of usability.

### Not Even Close

I won’t go as deeply into Marcus’ other points (such as the limits on the type of reasoning deep learning can do or its ability to understand natural language) in detail, but I found it interesting how closely those points coincide with someone else’s arguments about why “strong AI” probably won’t happen soon. That was written before I’d even heard of Gary Marcus and the relevant section is comprised mostly of ideas that I heard many times over the course of my grad school education (which is now far –  disturbingly far – in the past). Yes, these points are again valid, but among people in the field, they again have little novelty.

By and large, Marcus is right about the limitations of statistical machine learning, and anyone suggesting that deep learning is spectacularly different on these particular axes is at least a little bit misinformed (okay, maybe it’s a little bit different). For the most part, though, I don’t see experts in the field suggesting this.  Certainly not to the pathological levels hinted at by Marcus’ Medium post. I do readily admit the possibility that, amid the glow of high-profile successes and the public spotlight, that all of the academic theory and empirical results showing exactly how and when machine learning fails may get lost in the noise, and hopefully I’ve done a little to clarify and contextualize some of those failures.

So is that it, then? Is deep learning really nothing more than another thing drawn from the same bag of (somewhat fragile) tricks as the rest of machine learning?  As I’ve already said, that depends on how you look at it. If you look at it as we have in this post, yes, it’s not so very different. In my next post, however, we’ll take a look from another angle and see if we can spot some differences between deep learning and the rest of machine learning.

BigML has added multiple linear regression to its suite of supervised learning methods. In this sixth and final blog post of our series, we will give a rundown of the technical details for this method.

## Model Definition

Given a numeric objective field $y$, we model its response as a linear combination of our inputs $x_1,\cdots,x_n$, and an intercept value $\beta_0$.

$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n = \beta_0 + \sum_{i=1}^n \beta_i x_i$

### Simple Linear Regression

For illustrative purposes, let’s consider the case of a problem with a single input. We can see that the above expression then represents a line with slope $\beta_1$ and intercept $\beta_0$.

$y = \beta_0 + \beta_1 x$

The task now is to find the values of $\beta_0, \beta_1$ that parameterize a line which is the best fit for our data. In order to do so we must obtain a metric which quantifies how well a given line fits the data.

Given a candidate line, we can measure the vertical distance between the line and each of our data points. These distances are called residuals. Squaring the residual for each data point and computing the sum, we get our metric.

$S = \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2$

As one might expect, the sum of squared residuals is minimized when $\beta_0, \beta_1$ define a line that passes more or less thorough the middle of the data points.

### Multiple Linear Regression

When we deal with multiple input variables, it becomes more convenient to express the problem using vector and matrix notation. For a dataset with $n$ rows and $p$ inputs, define $\mathbf{y}$ as a column vector of length $n$ containing the objective values, $\mathbf{X}$ as a $n \times p$ matrix where each row corresponds to a particular input instance, and $\mathbf{\beta}$ as a column vector of length $p$ containing values of the regression coefficients. The sum of squared residuals can thus be expressed as:

$S = ||\mathbf{y - X\beta}||_2^2$

The value of $\mathbf{\beta}$ which minimizes this is given by the closed-form expression:

$\mathbf{\beta = (X^T X)^{-1} X^T y}$

The matrix inverse is the most computationally intensive portion of solving a linear regression problem. Rather than directly constructing the matrix $\mathbf{X}$ and performing the inverse, BigML’s implementation uses an orthogonal decomposition which can be incrementally updated with observed data. This allows for solving linear regression problems with datasets which are too large to fit into memory.

## Predictions

Predicting new data points with a linear regression model is just about as easy as it can get. We simply take the coefficients $\beta_0,\ldots,\beta_n$ from the model and evaluate the regression equation above to obtain a predicted value for $y$. BigML also returns two metrics that describe the quality of the prediction: the confidence interval and the prediction interval. These are illustrated in the following figure:

These two intervals carry different meanings. Depending on how the predictions are to be used, one will be more suitable than the other.

The confidence interval is the narrower of the two. It gives the 95% confidence range for the mean response. If you were to sample a large number of points at the same x-coordinate, there is a 95% probability that the mean of their y values will be within this range.

The prediction interval is the wider interval. For a single point at the given x-coordinate, its y value will be within this range with 95% probability.

## BigML Field Types and Linear Regression

In the regression equation, all of the input variables $x_n$ are numeric values. Naturally, BigML’s linear regression model also supports categorical, text, and items fields as inputs. If you have seen how our logistic regression models handle these inputs, then this will be mostly familiar, but there are a couple important differences.

### Categorical Fields

Categorical fields are transformed to numeric values via field codings. By default, linear regression uses a dummy coding system. For a categorical field with class values, there will be n-1 numeric predictor variables x. We designate one class value as the reference value (by default the first one in lexicographic order). Each of the predictors corresponds to one of the remaining class values, taking a value of 1 when that value appears and 0 otherwise. For example, consider a categorical field with values “Red”, “Green”, and “Blue”. Since there are 3 class values, dummy coding will produce 2 numeric predictors x1 and x2. Assuming we set the reference value to “Red”, each class value produces the following predictor values:

Field value x1 x2
Red 0 0
Green 1 0
Blue 0 1

Other coding systems such as contrast coding are also supported. For more details check out the API documentation.

### Text and Items Fields

Text and items fields are treated in the same fashion. There will be one numeric predictor for each term in the tag cloud/items list. The value for each predictor is the number of times that term/item occurs in the input.

### Missing Values

If an input field contains missing values in the training data then an additional binary-valued predictor will be created which takes a value of 1 when the field is missing and 0 otherwise.  The value for all other predictors pertaining to the field will be 0 when the field is missing. For example, a numeric field with missing values will have two predictors: one for the field itself plus the missing value predictor. If the input has a missing value for this field, then its two predictors will be (0,1), in contrast, if the field is not missing, but equal to zero, then the predictors will be (0,0).

## Wrap Up

That’s pretty much it for the nitty-gritty of multiple linear regression. Being a rather venerable machine learning tool, its internals are relatively straightforward. Nevertheless, you should find that it applies well to many real-world learning problems. Head over to the dashboard and give it a try!