Skip to content

GDPR Compliance and its Impact on Machine Learning Systems

Unless you’ve been hiding under a rock, you’ve probably heard of the Cambridge Analytica Scandal and Mark Zuckerberg’s statements about the worldwide changes Facebook is making in response to European Union’s General Data Protection Regulation (GDPR). If your business is not yet in Europe, you may be taken aback by the statement from U.S. Senator Brian Schatz that “all tech platforms ought to adopt the EU approach to (data protection)”. This, despite the fact that 45% of U.S. citizens think that there is already “too much” government regulation of business and industry.

GDPR

Image source: Convert GDPR (https://www.convert.com/GDPR/).

So yes, GDPR is a big deal indeed. When it becomes the law in European Union later this week on May 25, 2018, it will improve data protection for EU citizens dealing with companies not only in Europe but all around the world. In other words, whether your company is based in EU or not, as long as you have EU citizens as customers or users and you process their data, GDPR is very much relevant for your business.

There are many aspects of GDPR that cover various data processing best practices. One of the critical concepts is “Personal Data“. Personal data in GDPR are defined as anything that can be used to directly or indirectly identify an individual. The second concept you should get familiar with is “Personal Data Processing“. It is “any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction.” And then we have the concepts of the “Controller“, which determinates the purpose of processing of personal data and the “Processor“, which processes personal data on behalf of the Controller. Enough of definitions. We’re good to go right? Not so fast. The following example shows how quickly things get complicated.

A few days ago, I had a conversation with the representative of one of the biggest tech companies in the world who had presented to the audience a predictive application explaining how photos of the customers are being stored as they are queuing up for a service. After the presentation, I asked him about the effect of GDPR on the described application and he started talking about PII (Personally Identifiable Information) instead. PII is a concept from U.S. privacy laws and isn’t exactly overlapping with the personal data definition in GDPR and this can quickly turn out to be a very costly confusion for many more companies serving EU data subjects.

While companies large and small are wrestling with the waves of change in handling user data introduced by GDPR, we’d like to also turn our attention to how those changes impact Machine Learning efforts in the coming months and years.

How BigML helps manage GDPR impact on Machine Learning systems

In order to explain the effects of GDPR on Machine Learning let’s have a look at the three important rights that GDPR grants to the owner of personal data (or the “Data Subject” in GDPR parlance): Non-discrimination Right, the Right to Explanation and the Right to be Forgotten.  We’ll cover them in the order they appear in a typical Machine Learning workflows: starting with data wrangling and feature engineering, continuing with modeling and finishing with model deployment and management.

Data Wrangling and Feature Engineering

The first right of the data subject is the “Non-discrimination Right”.  GDPR is quite explicit when it comes to processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation.

These data points are incredibly valuable for certain legitimate use cases such as genetic research so they must not be taken literally as a “no-go” zone.  However, the data subjects at the very least need to be made aware and be given the ability to opt-in for such schemes should they choose to give their explicit consent. Regardless of opt-ins and the temptation to enrich data to further improve the model accuracy, there are clear lines that shouldn’t be crossed as expressed in the bestseller by Cathy O’Neil, “Weapons of Math Destruction”.  This book contains great examples of how human bias can be inherently hidden in your data and can get reinforced in the predictive models built on top of it if you are not careful e.g., even zip codes can sometimes result in racial discrimination.

To clarify, on the BigML platform, we make a clear distinction between personal data such as emails, credit card details etc. and data meant for Machine Learning use.  The former is required for continuing our services without interruptions and do not factor in your Machine Learning workflows. As for the latter, you can easily see, filter and add new fields to your datasets or plot the correlations between various fields by using the dynamic scatterplot dataset visualization capability if you suspect certain fields may be proxies for more troublesome variables you’d rather stay away from during your modeling.  On the other hand, building an Association model can yield interesting statistically significant (association) rules that can point to built-in biases in your dataset. Stratified sampling techniques can also be good allies to ensure that your dataset contains well-balanced representations of the real-life phenomenon you’re looking to model in a way conducive to bias-free Machine Learning outcomes.

Modeling and Predictions

The second right is the “Right to Explanation”, referring to the need for the Controller and/or Processors to provide meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject. Authorities are still discussing how far this right should go and whether it’s necessary to explain all the minutiae related to data transformations that the drive predictive modeling process. This becomes an even bigger problem for workflows involving multiple algorithms some of which can be inherently difficult to interpret — think Deepnets.

By design, the BigML platform supports multiple capabilities that come to the rescue here.  From a global model perspective, each supervised learning model has a model summary report explaining which data fields had more or less impact on the model as a whole.  In addition, the visualizations for each ML resource allows the practitioner to better introspect and share the insights from her models.

From an individual prediction perspective, BigML supports Prediction Explanations both on the dashboard and the API.  In addition, batch predictions can be configured in a way that includes field importances per class or confidence values to augment predictions.

Prediction Explanation

An Example of Prediction Explanation on BigML.

Deployment, Retraining and Model Updates

The third right of data subjects is the “Right to be Forgotten”. This permits data subjects to have the controller erase all personal data concerning him or her. On the surface, this seems pretty straightforward.  Just delete the corresponding account and its data records and voila! But if we think along the Machine Learning workflow terms the question arises: Does this mean that data subjects have the right to demand that your predictive model gets retrained without their data? The interpretations as to where the line should be drawn can get quite tricky which leaves enough room for experts and consultants to operate in.

BigML has been designed from the ground up with an eye towards the key principles of traceability, repeatability, immutability, and programmability.  These design traits inherently help with GDPR compliance. Take, for example, the BigML platform’s reification capability, which helps trace back any workflow and its corresponding original resources that gave rise to a particular ML resource of interest.  This yields both process transparency and ultimately traceability.

One can also fully automate his Machine Learning workflows and ensure easy repeatability either with a single API call or through the execution of a corresponding WhizzML script.  Why is this important?  Well, because in the event of a retraining needed on a new dataset that may exclude certain records, the effort required is reduced to a single action.

Conclusion

We only scratched the surface on the importance and the potential impact of GDPR on the world of Machine Learning.  Hope this gives you some new thoughts on how your organization can best navigate this new regulatory climate while still being able to reach your goals.  As Machine Learning practitioners collectively experience the ups and downs of GDPR and learn the ropes, let BigML and it’s built-in qualities such as traceability, repeatability and interpretable models be an integral part of your action plan.

OptiML Webinar Video is Here: Automatically Find the Optimal Machine Learning Model!

The latest BigML release has brought OptiML to our platform, and it is now available from the BigML Dashboard, API, and WhizzML. This new resource automates Machine Learning model optimization for all knowledge workers to lower even more the barriers for everyone to adopt Machine Learning.

OptiML is an optimization process for model selection and parametrization that automatically finds the best supervised model to help you solve classification and regression problems. OptiML creates and evaluates hundreds of supervised models (decision trees, ensembles, logistic regression, and deepnets) with multiple configurations to finally return a list of the best models for your data, so it saves practitioners significant time in exploring hypothesis spaces by preventing exhaustive trial and error experimentation with different algorithms and their parameter configurations. All these details and more are explained in the video webinar, released yesterday during the official launch, and available on the BigML YouTube channel.

For further learning on OptiML, please visit our release page, where you will find:

  • The slides used during the webinar.
  • The detailed documentation to learn how to use OptiML from the BigML Dashboard and the BigML API.
  • The series of six blog posts that gradually explain OptiML.

Thanks for your support and great feedback! Feel free to reach out to the BigML Team at support@bigml.com anytime. Your suggestions and questions are always welcome!

OptiML: The Nitty Gritty

One click and you’re done, right? That’s the promise of OptiML and automated Machine Learning in general, and to some extent, the promise is kept. No longer do you have to worry about fiddly, opaque parameters of Machine Learning algorithms, or which algorithm works best. We’re going to do all of that for you, trying various things in a reasonably clever way until we’re fairly sure we’ve got something that works well for your data.

Sounds really exciting, but hold your horses. Yes, we’ve eliminated the need to optimize your ML model, but that doesn’t mean you’re out of the loop. You still have an important part to play because only you know how the model is going to be deployed in the real world. Said another way, only you know precisely what you want.

In this post, the last one of our series of posts introducing OptiML, I’m going to talk about a couple of the things you still have to worry about even if you use OptiML to find the best model. And I’m not talking about data wrangling or feature engineering, though you certainly still have to do that. I’m talking about ways that you can really make or break the process of model selection.

It’s important here to realize that these worries aren’t at all unique to OptiML. These are things you always have to worry about whenever you’re trying to choose from among the infinity of possible Machine Learning methods. What OptiML does is brings these worries front and center where they belong, rather than hiding out among lists of possible parameters.

We Have The Technology

The core technology in OptiML is Bayesian Parameter Optimization, which I’ve written about a few other times. The basic idea is simple: Since the performance of a model type with given parameters is dependent on the training data, we’ll begin by training and testing a few different models with varying parameters. Then, based on the performance of those models, we’ll learn a regression model to predict how well other sets of parameters will perform on the same data. We use that model to generate a new set of “promising” parameters to evaluate, then feed those back into our regression model. Rinse and repeat.

There’s a little bit of cleverness here in choosing the next set of promising parameters. While you want to choose parameters that are likely to work well, you also don’t want to choose parameters that are too close to what you’ve already evaluated. This trade-off between optimization and exploration motivates various “acquisition functions” that attempt to choose candidates for evaluation that are both novel and likely to perform well.

But all of that is handled for you behind the scenes. It seems like this thing is absolutely primed to give you exactly what you want. So what could possibly go wrong?

Nothing, as long as you and the algorithm are on the same page about exactly what you want.

What Do You Really Want?

If you open up the configuration panel for OptiML, you’ll notice that one of the first choices we offer you is that of the metric to optimize. This is the metric that will drive the search above. That is, the search’s goal will be to find the best possible value for the metric you specify.

BigML_OptiML_Interface.png

For the non-experts among us, the list of metrics probably looks like word salad. Which one should you choose? To give you a rough idea, I’ve made a flowchart that should get you to a metric that suits your use case.

BigML_OptiML_Flowchart.png

The first question in that flowchart is whether or not you want to bother with setting a prediction threshold. This refers to the use of BigML’s “operating point” feature. Generally, if you can, you’ll want to do this, as it allows you finer control of the model’s behavior. For users with less technical savvy, though, it’s significantly easier to just use the default threshold. In this case, if your dataset is balanced, you can simply tell OptiML to optimize the accuracy. If the dataset is imbalanced, you might want to try optimizing the phi coefficient, or the F-measure, which control in different ways for imbalanced data.

If you are willing to set a threshold, then you can go a bit deeper. The first thing you can ask yourself is whether or not you have a ranking problem. This is a problem for which the individual correctness of predictions isn’t the primary goal, and instead we seek to rank a group of points so that the positive ones are in general higher ranked than the negative ones.

A good example of a ranking-style prediction is ranking stocks for stock picking. Typically, you’re going to have some number of stocks you’re willing buy which is much smaller than the number of total stocks you could buy. What matters to you in this case isn’t whether you get each and every instance right. What matters is whether or not those top few examples are likely to be positive.

A second concern here, if you have a ranking problem is whether or not the top of the ranking is more important than the rest. The stock picking example is clearly a case where it is: You care about the profitability or lack thereof of the stocks you pick, and are less concerned with the ones you didn’t, so the correctness of the top of the ranking, of those stocks the algorithm told us to pick, is of higher importance than the correctness of those ranked near the bottom. In this case, a metric based on the optimal threshold for the data, like Max. Phi, will typically correlate well with a model’s performance.

The opposite case is a draft-style selection, where you don’t necessarily get to pick from the top of the order. You may pick in the middle or at the end, but you always want your pick to have highest possible chance of being correct. In this case, metrics like the ROC AUC, or one of the rank correlation metrics like the Spearman correlation would be an appropriate choice.

Optimizing for the right metric is one way you can squeeze a little bit more performance out of your model. If a one or two percent difference isn’t that important to you, you can do perfectly fine without this step. If you’re very concerned about performance, however, or have a very particular way of measuring performance, it’s important to understand these metrics: There’s pretty good evidence that these metrics aren’t in complete agreement a significant amount of the time, so take your time and choose the right one.

Cross-Validation: You’re Doing It Wrong!

So, once we have our metric selected, how does OptiML decide which of your possible models is best? Of course, by default, we use cross-validation, as does most everyone. The idea here, if you’re unfamiliar, is that you hold out a random chunk of the data as the evaluation set and train your model on the rest. Then, just to be sure you haven’t gotten lucky (say, by choosing most of the difficult-to-classify points into your training data rather than your test data), you do it a bunch of times and aggregate the results.

It’s a simple idea that tends to work well in practice . . . until it goes horribly wrong in the worst possible way, which is why we offer the option for a custom holdout set.

BigML_OptiML_Cross_Validate.png

Within cross-validation lies a brutal trap for the unwary: If your training instances contain information about each other that will be unavailable in a production setting, cross-validation will give results that are anywhere from “optimistic” to “wildly optimistic”.

What do I mean by “information about each other”? One place this happens a lot is with data that’s distributed in time, and where adjacent points share a class. Consider stock market prediction, supposing you want to predict whether a stock is going to be higher or lower at the end of the day for each minute of the day, based on, say, the trailing ten minutes of data (volume, price, and so on). First note that it’s very likely that adjacent points share a class (if the market’s close is higher than its level at 10:30, probably that is also true at 10:31 and 10:32). Note also that these adjacent minutes also share a lot of history; most of their trailing 10 minutes overlap.

What does all of this add up to? You’ve got points that are near-duplicates of one another in your dataset. If you take a random chunk of data as test data, it’s likely that you have near-duplicates for all of those points in your training data. Your model, having seen essentially the right answers, will perform very well, and cross-validation will tell you so. And it will be disastrously wrong, because on days in the future, where you don’t have answers from nearby instances, the classifier will fail completely. Said another way, cross-validation gives you results for predictions on days that you see in training. In the real world, your model will not have the benefit of seeing points from the test day in the training data.

Lest you think this is a rare case, consider trying to predict a user’s next action in some sort of UI context. You might have 10,000 actions for training, but only a couple dozen users. Users tend to do the same thing in the same way over and over again, so for every action in your training data, there’s probably several near-duplicates in your test data. The model will get very good at predicting the behavior of your training users, and cross-validation will again tell you so. But if you think that performance is going to generalize to a novel user, you’re very much mistaken.

It’s a problem that’s probably more common than you think, and the root cause is again Machine Learning giving you just what you asked for. In this case, you’re asking, “if I knew the correct classes for a random sample of most of the points in the dataset, how well could I predict the rest”. And cross-validation answers “Very well!” and gives you a fist bump. Many times, the answer to that question is a good proxy for the answer to the real question, which is, “how well will my model predict on data that is not part of the dataset?” But for certain cases like the ones above, it most certainly is not.

The solution isn’t difficult – you just construct a holdout set that measures what you want, rather than what cross-validation is telling you. If it’s how a model trained on past stock prices will predict future stock prices, you hold out the most recent data for testing and train on the past. If you want to know how well your UI model will generalize to novel users, you hold out data from some users (all of their data) and train on the remaining users. Once you’ve constructed the appropriate holdout dataset, you pass it to OptiML and get your answer.

Sometimes it can be difficult to know for sure if your problem falls into this category. But if your problem has a character where the data comes in “bins” of possibly correlated instances, like days, users, retail locations, servers, cities, etc., it never hurts to just try a test where you hold out some bins and test on others. If you see results that are worse than naive cross-validation, you should be very suspicious.

The Big Picture

Automated Machine Learning doesn’t know how you’re going to deploy your model in the real world. Unless you tell it differently, the best it can do is make assumptions that are true in many cases and hope for the best. Ensuring the model is optimized for use in the real world and that you have a reasonable estimate of its performance therein is always part of the due diligence you have to perform when engineering a Machine Learning solution. OptiML allows you to focus on these choices – the parts of your ML problem outside of the actual data – and leaves model optimization to us.

And remember, whether or not you use OptiML, BigML, or any other ML software tool, the choice of metric and manner of evaluation are important issues that you ignore at your own peril! The more we can push these “common sense rules” about Machine Learning into the general discourse about the subject, the closer we get to a world where anyone can use Machine Learning.

Want to know more about OptiML?

If you have any questions or you would like to learn more about how OptiML works, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Finding your Optimal Models Automatically with WhizzML and OptiML

by

This blog post, the fifth of our series of posts about OptiML, focuses on how to programmatically use this resource with WhizzML, BigML’s Domain Specific Language for Machine Learning workflow automation. To refresh your memory, WhizzML allows you to execute complex tasks that are computed completely on the server side with built-in parallelization.

BigML Resource Family Grows

Our posts so far have been exclusively focusing on OptiML since it’s the newest resource we are presenting on the BigML Dashboard. The theme of, E Pluribus Unum (“out of many, one”), best explains the rationale behind these new resources that make use of multiple algorithms to converge on a best fitting model or ensemble for better results.  All three resources will be accessible programmatically come May 16.

Whereas OptiML identifies the best performing model for each classification or regression algorithm. Now that we’ve covered the brief introductions, let’s see how to work with these new resources using WhizzML.

Creating an OptiML

To start creating an OptiML via WhizzML, we begin with an existing dataset that will be split to train and evaluate more than a hundred models, including decision trees (aka models), ensembles, logistic regressions and deepnets. So our WhizzML code will need to include the dataset ID we want to use for training as shown below:

;; creates an OptiML with default settings
(define my-optiml 
  (create-optiml {"dataset" my-dataset})

As we commented in the previous post, the BigML API is mostly asynchronous, meaning the execution will return the OptiML ID before its creation is completed. This implies that the model exploration process will continue after the code snippet is executed. You can use the directive “create-and-wait-optiml” to be sure that the exploration process has been finished:

;; creates an OptiML with default paramters. Once it's
;; completed the ID is stored in my-optiml variable
(define my-optiml 
  (create-and-wait-optiml {
    "dataset" my-dataset
    }))

Given that different use cases will require different properties,  BigML provides several parameters to fine tune the model’s exploration process (more on this will be in the OptiML documentation to be released on May 16). Here, we will configure an OptiML via WhizzML to set the metric and the class to optimize by using property pairs such as <property_name> and <property_value>. Let’s see how to create an OptiML that optimizes the classifier search according to the area under the curve (AUC metric) and with “Mexico” as the class. Here is the straightforward code for that example case:

;; creates an OptiML setting parameters. Once it's
;; completed the ID is stored in my-optiml variable
(define my-optiml 
  (create-and-wait-optiml {
    "dataset" my-dataset
    "metric" "area_under_roc_curve"
    "objective_field" "00000d"
    "metric_class" "Mexico"
    }))

Once the model exploration process is complete and we have created an OptiML, we can easily retrieve the first model (the top-performing one) to get predictions from it as follows:

;; retrieves the first model from an OptiML
(define best-model (head (get (fetch my-optim)))

Want to know more about OptiML?

If you have any questions or you would like to learn more about how OptiML works, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

2ML Underscores the Need to Adopt Machine Learning in all Businesses and Organizations

More than 300 attendees and 20 international speakers didn’t miss the second edition of 2ML: Madrid Machine Learning. #2ML18 gathered mostly decision makers and Machine Learning practitioners coming from six different countries: the US, Canada, China, Uruguay, Holland, Austria, and of course Spain, as the event was held in Madrid.

The opening remarks were given by Luis Martín, CEO at Barrabés.biz, who welcomed the audience to their new and impressive co-working space. After him, the two keynotes of the event provided the base of 2ML, analyzing the basic concepts of Machine Learning, both from a technical and business perspective.

The technical concepts came from the BigML CEO, Francisco Martín. With him, we learned what Machine Learning is, the basic ML workflow, kinds of algorithms, how ML is used to make predictions, and how to make better decisions in several domains. In other words, the complete ML process end-to-end. Explaining the differences between AI, ML and Deep Learning, among many other concepts, helped the audience understand the meaning of Machine Learning from a technical perspective. The second keynote was mastered by Ed Fernández, Co-Founder and partner at Naiss.io, who focused on the state of the art in Machine Learning from a business perspective. Ed Fernández presented the impact of ML in Venture Capital, the M&A trends in ML and AI, the enterprise AI scene, and finally the ML market adoption trends, evolution, and platformization of Machine Learning in the Enterprise.

 

After we established the basis of this 2-day event with the basic concepts of Machine Learning, we could delve into a multitude of real-world applications of ML. This block of examples took most of the agenda, as we believe it’s key to see how ML is already being applied in many organizations. The first example regarding how ML is used for entertainment was presented by BigML’s CIO, Poul Petersen, who introduced how BigML Deepnets predicted the Oscar Winners when we got 6 out of 6 right. Secondly, we heard Jose Ángel Alonso Cuerdo, Director at KPMG Data Analytics & AI, explain how they design sporting calendars using Machine Learning for the main sport leagues such as the NBA, the Australian Football League, the South Eastern Athletic Conference, and the Atlantic Coast Conference, among others.

We also saw an interesting example of how Machine Learning is being used to help lawyers get the NDA out of their way, presented by Arnoud Engelfriet, Founder at JuriBlox B.V. This topic was in fact one of the most popular talks, getting the attention of many attendees, especially during the first Q&A session. The last talk of the morning was a joint session provided by Jordi Palau, Supply Chain Director at Celsa Group, and Joel Montoy, Director at Aquiles Solutions. Both Jordi and Joel presented how Celsa Group together with Aquiles Solutions have been optimizing all the steps of the End-to-End Supply Chain process, where they plan, source, manufacture, and deliver steel.


After the lunch break, we learned how Machine Learning is used in emerging markets. David del Ser, Practice Director at Bankable Frontier Associates, presented the key role that ML plays for financial inclusion in a world where there are no bank accounts, poor people do not get any bank support for lack of information, and they cannot prove what they earn; however, they do have access to smartphone devices and this changes completely their situation when it comes to improve their businesses and lifestyles.

Another example of ML for social good was presented by Thor Muller, CIO at Off Grid Electic, the African startup that offers electric solutions using Solar energy in Rwanda and Tanzania and uses Machine Learning to predict whether their clients will churn at the end of the cycle.

To conclude this block, we found out how Frogtek applies ML to help “base-of-the-pyramid” Mexican micro-retailers to better control and grow their businesses, gaining in operational efficiency. Guillermo Caudevilla, Chief Technology Officer at Frogtek explained how the retailers register every transaction that takes place in their shops getting easy access to metrics and value-added services fueled by their own and other shopkeepers’ data. All this transactional data is also aggregated and fed into a business intelligence and marketing analytics system that Consumer Packaged Goods companies rely on for better visibility into a traditionally opaque sector.

 

After the second coffee break of the day we continued with real-world use cases applied in Marketing and Human Resources. For the former, Seamus Abshere, CTO at Faraday.io explained how they take customer data, combine it with a proprietary national database and Machine Learning templates to help other companies acquire, upsell, and retain more customers. This journey on how to make Machine Learning work for B2C revenue optimization was appealing for most attendees as it is an interest shared by many companies.

The last two talks of the first day at 2ML were devoted to Human Resources. Firstly, David J. Marcus, Sr. VP of Special Projects at PandoLogic, shared how Machine Learning is revolutionizing the traditional recruitment procedures, which mainly used professional recruitment firms and advertisements in newspapers. Now, at PandoLogic, ML optimizes recruitment campaign spending in real-time by utilizing over 10 years’ worth of historical job performance data containing nearly 200 billion data attributes. The models work by establishing real-time predictive-performance benchmarks that drive when, where, and how each employer’s job is dynamically campaigned online.

The last speaker of the day was Patrick Coolen, Manager HR analytics at ABN AMRO, who shared the journey of the ABN AMRO analytics team in the past four years and how a big organization like this bank uses Machine Learning to discover interesting insights about their employees. This talk covered why all companies should do analytics in HR in the first place, how to convince senior management to apply ML techniques, how to set up an HR analytics function, and finally, how ABN AMRO uses ML in HR, providing actual examples as well as the practical takeaways of the 10 golden rules of HR analytics.

 

The second day of the 2ML event, May 9, started with a review of the main concepts shared during the first day. Santiago Márquez, CTO at Barrabés.biz, presented the talk to refresh basic concepts. Then, we continued with more real-world applications of Machine Learning, this time in the finance, investments, and telecom business sectors. Jorge Pascual, CEO at Anfix gave a detailed talk about how the Accounting industry must be reinvented, as by 2020 more than 80% of traditional financial services will be delivered by cross-functional teams that include Machine Learning. Instead of fearing the automation, Jorge Pascual focused on the positive side where ML will make accountants more efficient and productive.

Following the same path, Arturo Moreno, CEO at PreSeries, presented examples of how Machine Learning upgrades technology financing by enabling the data-powered processes, emphasizing how early-stage investment decisions can be better made with data. For instance, a growing number of investors are experimenting around data-driven strategies to early-stage investing, and the names of Social Capital, EQT, GV, or InReach Ventures, to name a few, are already showing results. Here is where PreSeries marks an important inflection point since it allows all investors to leverage the benefits that data and ML represent for the generation of insights. PreSeries believes that a data-centric culture at investing organizations will not only bring faster and better investment decisions, but will also allow investors to be helpful to startups in a much more productive manner thanks to the insights that the analysis of their data will bring.

The last two real-world use cases shown at #2ML18 were about ML used in Blockchain technology and in Telecom. For the former, again Santiago Márquez, CTO at Barrabés.biz introduced the synergies between Blockchain and Machine Learning, how to apply ML to Blockchain and the current status of this approach. The latter, provided by Francisco Martín Pignatelli, Group Head of Radio Product at Vodafone, presented the big challenge that traditional telecom systems are experiencing and cannot manage: the growth of data that telecom customers demand and generate. Pignatelli showcased why Machine Learning is the best option to address this challenge, since it allows networks to be predictive as opposed to reactive, which changes how technology has worked in Radio for the last 25 years.

 

The last, but by no means least, part of the event was the importance of adopting Machine Learning in all organizations across the entire corporate structure.

Francis Cepero, Head of Vertical Market Solutions at A1 Digital showed in a very interactive manner, why MLaaS Platforms are crucial to accelerate corporate Machine Learning training programs for data analysts and other professions relevant for decision making around data-centric business models. The whole audience had the chance to get to know each other by actively discussing the specific challenges that Francis Cepero was proposing throughout his presentation. This topic was concluded by Luis Martín, CEO at Barrabés.biz, who completed Francis’ point of view by providing the right steps that any company should follow to adopt ML, going from ideas to clear results.

The second edition of 2ML was concluded with a practical Machine Learning workshop provided by BigML’s CIO Poul Petersen, to put in practice the basic concepts learned during this two-day event. Poul Petersen showcased basic Machine Learning workflows and techniques that make ML easier than ever with MLaaS platforms like BigML. To see more details about the event, please check out here a few photos of this complete and fun event! There will be many more photos to be shared shortly as well as the presentations shared by the speakers. Stay tuned!

Finding your Optimal Model Automatically with the BigML API and OptiML

In this post, the fourth of our 6 blog posts focused on optimizing Machine Learning automatically, we will explore how to use OptiML with the BigML API. So far we have covered an introduction to OptiML, seen a use case, and walked through the BigML Dashboard step-by-step. Here we will shift the focus to our REST API.

optiml-workflow

OptiML can be applied to both classification and regression problems. Because it’s an entirely automated process, it requires very few parameters overall. With regards to programmatic control, options are mostly constrained to limiting the extent of the search for the total time, the total number of algorithms, the types of algorithms considered, and performance metrics. Longtime readers will notice that this post is similar in structure to other release tutorials, due to the overall standardization of resource creation and execution using the BigML API.

Authentication

Before using the API, you must set up your environment variables. You should set BIGML_USERNAME, BIGML_API_KEY, and BIGML_AUTH in your .bash_profile. BIGML_USERNAME is just your username. Your BIGML_API_KEY can be found on the Dashboard by clicking on your username to pull up the account page, and then clicking on ‘API Key’. Finally, BIGML_AUTH is simply the combination of these elements.

export BIGML_USERNAME=my_name
export BIGML_API_KEY=123456789
export BIGML_AUTH=“username=$BIGML_USERNAME;api_key=$BIGML_API_KEY;“

Analogous to the Dashboard process, the first step is uploading a data source to be processed. You can point to remote sources, or upload files locally, using a range of different file formats. Using the terminal, the CURL command can be used to upload the file “loans.csv” which was utilized in our previous OptiML blog post.

curl "https://bigml.io/source?$BIGML_AUTH" -F file=@loans.csv

Creating a Dataset

A BigML dataset is a separate resource and is a serialized form of your data. In the Dashboard, it is displayed with some simple summary statistics and is the resource consumed by Machine Learning algorithms. To create a dataset from your uploaded data via the API, you can use the following command, which specifies the source used to generate the dataset.

curl "https://bigml.io/dataset?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"source": "source/5af59c8692527328b40007ed"}'

Creating an OptiML

OptiML automates the entire process of model selection and parameterization end-to-end for classification and regression problems. This automation accelerates the process to improve model performance, and thus makes sophisticated workflows accessible to non-experts. In order to create an OptiML, all you need is the dataset ID.

curl "https://horizon.bigml.io/optiml?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"dataset": "dataset/5af59f9cc7736e6b33005697"}'

Once the process of creating an OptiML is complete, you can return all of the models that have been optimally created as well as their corresponding performance metrics, whether they are logistic regressions, models (decision trees), deepnets, or ensembles. Because an OptiML might be composed of hundreds or even thousands of fields, it is possible to specify that only a subset of fields needs to be retrieved.

curl "https://tropo.dev.bigml.io/andromeda/optiml/5af5a712b95b397877000372?$BIGML_AUTH"

From the list of optimal models returned by OptiML, you can continue your Machine Learning workflow in whichever direction is most applicable. In the example below, we select the best performing model overall, in this case, a logistic regression, and perform an evaluation with our original dataset. Just as easily, we could choose to run batch predictions on new data, or consider other more complicated workflows with the optimized models.

curl "https://tropo.dev.bigml.io/evaluation?$BIGML_AUTH" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{"dataset": "dataset/5af5a69cb95b39787700036f",
       "logisticregression": "logisticregression/5af5af5db95b3978820001e0"}'

Want to know more about OptiML?

If you have any questions or you would like to learn more about how OptiML works, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Finding your Optimal Model Automatically with Zero Lines of Code

The BigML Team has been working hard to bring OptiML to the platform, which will be available on May 16, 2018. As explained in our previous post, OptiML is an automatic optimization process for model selection and parametrization (or hyper-parametrization) to solve classification and regression problems. Selecting the right algorithm and its optimum parameter values is a manual and time-consuming task for any Machine Learning practitioner. This iterative process is currently based on trial and error (creating and evaluating different models to find the best one) and it requires a high level of expertise and intuition. OptiML accelerates the process of model search and parameter tuning, allowing non-experts to build top-performing models.

In this post, we will take you through the four necessary steps to find the top-performing model for your data using OptiML with the BigML Dashboard. We will use the Loan risk dataset which contains data from loan applicants to predict if whether applicants will be good or bad loan customers.

optiml-workflow

1. Upload your Data

As usual, you need to start by uploading your data to your BigML account. BigML offers several ways to do so: you can drag and drop a local file, connect BigML to your cloud repository (e.g., S3 buckets), or copy and paste a URL. This will create a source in BigML.

BigML automatically identifies the field types. In this case, we have 21 different fields. You can observe in the image below an excerpt of the fields for this loan risk dataset, such as the checking status, duration of the applied loan, credit history of the applicant and more.

source.png

2. Create a Dataset

From your source view, use the 1-click dataset menu option to create a dataset, a structured version of your data ready to be consumed by a Machine Learning algorithm.

1-click-dataset.png

When your dataset is created, you will be able to see a summary of your field values, some basic statistics, and the field histograms to analyze your data distributions. You can see that our dataset has a total of 1,000 instances. Our objective field is the “class”, a categorical field containing two different classes that label loan customers as “good” (700 instances) or “bad” (300 instances).

dataset.png

3. Create an OptiML

In BigML, you can use the 1-click OptiML menu option (shown on the left in the image below), which will use the default parameter values, or you can manually tune the parameters using the Configure OptiML option (shown on the right in the image below).

optiml-options.png

BigML allows you to configure the following parameters for your OptiML:

  • Maximum training time: an upper bound to limit the OptiML runtime. If all the model candidates are trained faster than the maximum time set, the OptiML will finish earlier. By default, it is set to 30 minutes. However, for big datasets, this may be too short and you will need to set a longer time for the OptiML to build and evaluate more models.
  • Model candidates: the maximum number of different models (i.e. models using a unique configuration) to be trained and evaluated during the OptiML process. The default is 128 which is usually enough to find the best model, but you can set it up to 200. The top-performing half of the model candidates will be returned in the final result.
  • Models to optimize: the algorithms that you want to be optimized: decision trees, ensembles (including Boosted trees and Random Decision Forests), logistic regressions (only for classification problems), and deepnets. By default, all types of models are optimized.
  • Evaluation: the strategy to evaluate the models and select the top performing ones. By default, BigML performs Monte Carlo cross-validation. Cross-validation evaluations usually yield more accurate results than single evaluations since they avoid the potential error derived from randomly selecting a too optimistic test dataset. Alternatively, you can select a specific test dataset if you need to optimize your models that way. To avoid unrealistically high performing evaluations due to the lack of cross-validation, BigML takes several subsets of the training data to build the same models and evaluates them using the test dataset.
  • Optimization metric and the Positive class: the optimization metric is used for model selection during the optimization process. For regression problems, BigML uses the R squared by default and the maximum phi coefficient for classification problems. However, you can also select other metrics such as the accuracy, the ROC AUC, or the F measure. (All these metrics are explained in detail in the evaluations chapter of the BigML Dashboard documentation.) For classification problems, you can also select a positive class to be optimized; otherwise, the average metric for all classes will be optimized.
  • Sampling: you can specify a subset of the instances of your dataset to create the OptiML.

configuration-panel.png

Analyze the OptiML Results

While your OptiML is being created, you will be able to observe a set of metrics to track the progress. Apart from the typical progress bar that you can find for all BigML resources, you can also see the elapsed time (which should not be higher than the maximum training  time configured), the number of models evaluated, the total resources created (taking into account models, datasets and evaluations), the total data size processed, and the scores of the last models evaluated.

in-progress.png

Once your OptiML is created, you can visualize the results in the OptiML view which is mainly composed of a chart and a table. This view allows you to compare and select the models that better suit your needs.

By looking at the chart, you can see the models ranked (from left to right) by the optimization metric score. If you mouse over the bars as shown in the image below, you will be able to see the model score +/- the standard deviation (calculated by using the different evaluations from the cross-validation) and the relevant model characteristics. Clicking on each bar, redirects you to the individual model view.

bar-chart.png

We can see below that this OptiML execution selected 8 decision trees, 11 ensembles, 20 logistic regressions, and 2 deepnets as the best models. In this case, the top-performing model is a deepnet with an f-measure of 0.67832, but the difference in performance in comparison with the following ensemble (with an f-measure of 0.67558) is not significant. If you look at the standard deviation, which indicates the potential variation of the f-measure depending on the random split of the dataset to train and evaluate the model, it is 0.02569 for out top model. This means that the f-measure for this model can take values from (0.67832-0.02569) = 0.65263 to (0.67832+0.02569) = 0.70401. Therefore, in this case you may prefer to select the second or third models in the list which are ensembles rather than a deepnet because they are easier to interpret and they are fatser to train.

table.png

You can select multiple models from the table (up to 20 for classification) and click the button “Compare evaluations” (see above) to compare them in the BigML evaluation comparison chart (see below). The ROC curve along with other evaluation measures (precision-recall, gain and lift curves) are also plotted in a chart so you can easily make comparisons and settle on the model of your choice.

compare-optiml.png

Each of the OptiML models can be found in your Dashboard listings under the OptiML tab as seen below. This is important to keep your Dashboard organized and to prevent mixing dozens of automatically created models with your manually configured models outside of OptiML. Keep in mind that the evaluations and the datasets created during the cross-validation phase are not listed in the Dashboard, but you can easily access them from the OptiML view.

optiml-list.png

4. Making Predictions from your Models

Comparing and analyzing several models helps you decide which is the best model for your particular use case. Once you have selected the model, you can start making predictions with it.

Single predictions

To make predictions for a single instance, simply click on the Predict option from the OptiML view (see below) or from the model view.

predict-optiml.png

A form containing all your input fields will be displayed and you will be able to set the values for a new instance. At the top of the view, you will also get all the objective class probabilities for each prediction. Remember that you can always ask for the prediction explanation, a feature recently added to BigML that provides more context and transparency to the underlying logic of the selected algorithm as applicable to a given prediction.

Batch predictions

If you want to make predictions for multiple instances at the same time, click on the Batch Prediction option and select the dataset containing the instances for which you want to know the objective field value.

predict-batch-optiml.png

You can configure several parameters for your batch prediction such as the option to include all class probabilities in the output dataset and file. When your batch prediction finishes, you will be able to download the CSV file and view the output dataset.

Want to know more about OptiML?

If you have any questions or you would like to learn more about how OptiML works, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Case Study: Automatically Training a Classifier with OptiML

This blog post, the second in a series of 6 posts exploring OptiML, the new feature for automatic model optimization on BigML, focuses on a real-world use case within the healthcare industry: medical appointment “no shows”. We will demonstrate how OptiML uses Bayesian parameter optimization to search for the the best performing model for your data. The status of the search is continually updated in the BigML Dashboard and the process yields a list of models ranked by performance, which enables  further exploration, evaluation, and prediction tasks.

The Dataset

With regards to healthcare expenses, “no show” appointments represent a major expense, estimated to cost hospitals over $150 billion per year. A “no show” is when all the necessary information for a medical appointment has been delivered, yet the patient fails to arrive at the scheduled appointment. This is distinct from events like cancellations, which require intervention and rescheduling, but yield nowhere near the same amount of financial burden.

dataset

Here we are using a verified dataset from Kaggle, which has documented over 100,000 examples of scheduled medical appointments, each consisting of 15 variables (fields). In this dataset, 22,319 instances, or ~20% of the total, are labeled as “no shows”. The fields describing each instance consist of simple descriptive information about the patient (e.g., age, gender), health characteristics (e.g., hypertension, diabetes), geographical information (e.g., neighborhood), communication information (e.g. SMS text), scheduling, and appointment date and time information.

Automatic Model Optimization

The OptiML configuration options, by design, consist of relatively few options. Selecting the objective field from the dataset is essential; in our use case, this is whether or not a patient is a “no show” for their visit. The other main parameters control how extensively the OptiML search will run by setting the maximum training time and the number of evaluations. The advanced configurations allow you to choose which types of Machine Learning algorithms will be evaluated among models (decision trees), ensembles, logistic regression, and deepnets, the evaluation approach (default being cross-validation), and the desired optimization metrics.

optiml-params.png

After choosing OptiML from the supervised drop-down menu, training the model is as simple as one-click! To monitor the progress of the models that are being trained and evaluated, the Dashboard will display the elapsed time, a running series of F-measures indicating evaluations, and a counter for the number of resources created.

in-progress

Once the OptiML process is complete, a summary view displays the total number of resources and models created and selected, along with the total elapsed time and amount of data processed. The selected models are the best-performing ones, according to the evaluation metric originally chosen for the analysis. Because many of the evaluations are part of a cross-validation process, it is expected for the number of selected models to be substantially lower than the total number of models evaluated.

Assessing Classification Model Performance

In our example, we can sort our best performing models by any performance metric we would like. We can also view two metrics simultaneously, such as precision and recall, if we would like to consider different applications or error tolerances. Here we are sorting by ROC AUC, and also viewing the corresponding accuracy for each of the selected models. We can see that our top performing model is a deepnet using an auto network search, which has a ROC AUC of 0.73899 and accuracy of 58.96%.

top-models

The Dashboard also allows us to select individual models generated by OptiML and compare their evaluations. Here we display the ROC curves for the top performing deepnet, ensemble, model (decision tree), and logistic regression. With regards to AUC, only the logistic regression model (in green, AUC = 0.661) noticeably underperforms relative to the alternatives.

roc

The root cause behind the estimated 54.3 million patients that skip scheduled medical care is widely thought to be a combination of financial hardship, anxiety about long wait times, transportation difficulty, and poor medical literacy. With our current dataset, however, it is difficult to engineer features that can accurately represent all of these issues. This is a good reminder that model optimization alone will not guarantee a result with top-notch performance. Like all Machine Learning methods, OptiML’s performance will only be as good as the data it’s fed. Regardless, OptiML allows us to quickly prototype and assess maximal performance with minimal effort, allowing for more resources to be dedicated to high-yield tasks such as additional data acquisition and feature engineering.

Want to know more about OptiML?

If you have any questions or you would like to learn more about how OptiML works, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

Introduction to OptiML: Automatic Model Optimization

BigML’s upcoming release on Wednesday, May 16, 2018, will be presenting a new resource to the platform: OptiML.  In this post, we’ll do a quick introduction to OptiML before we move on to the remainder of our series of 6 blog posts (including this one) to give you a detailed perspective of what’s behind the model optimization part of the release. Today’s post explains the basic concepts that will be followed by an example use case. Then, there will be three more blog posts focused on how to use OptiML through the BigML Dashboard, API, and WhizzML and Python Bindings for automation. Finally, we will complete this series of posts with a technical view of how OptiML works behind the scenes.

Understanding OptiML

At BigML, we are believers of human-in-the-loop Machine Learning and the importance of feature engineering which is driven by subject matter expertise in real-life situations.  As such, we have been treading carefully when it comes to ML automation as it is very easy these days to overpromise and fail to deliver a solution that doesn’t overfit or introduce unacceptable tradeoffs between bias and variance.

OptiML

BigML already offers a variety of highly effective supervised learning algorithms including deepnets, logistic regressions, models (decision trees), and ensembles. Thanks to our 1-click modeling capability, these can be executed with intelligent defaults to quickly form baseline models before you may prefer iterate on your project with different configuration options that can better solve your ML problem.  Over time, based on popular demand, we also have made available a number of complementary WhizzML scripts that you can easily clone and execute to perform automated hyperparameter tuning or feature selection for specific algorithms such as ensembles.

We have been witnessing clear interest from our users to further automate model selection for classification or regression problems they are tackling via BigML’s built-in automation options.  The drive for more productivity is nothing surprising.  However, the issue boils down to: is it possible to create a generalized automation approach whereby all applicable algorithms offered on the platform can be compared and contrasted with as little as a few clicks?  The obvious benefit being time savings in deciding which direction in the hypothesis space to further explore to find an optimum model as you avoid exhaustive trial and error experimentation with different algorithms and their parameter configurations.

Well, we have some good news to share on this very front! BigML’s OptiML capability is taking the automation of model selection to the next level.

In essence, OptiML is an automatic optimization option that will allow you to find the best supervised learning model for your data.

  • It can be used for both classification or regression problems.
  • It works by automatically creating and evaluating multiple models with multiple configurations (decision trees, ensembles, logistic regressions, and deepnets) by using Bayesian parameter optimization.
  • When the process finishes, you get a list of the best models so you can compare them and select the one that best suits your use case.

 

OptiML Automates Model Optimizations

The OptiML menu option on the BigML Dashboard attempts to find the best model for a given dataset by sequentially trying groups of parameters, training models using them, evaluating them, and trying a new group of parameters based on the results of the previous tries.  In many cases, this process tends to converge to a good solution faster due to the ability to reason about the expected outcome of a new set of parameters before they are executed. Furthermore, the search can be parameterized by a user-specified performance metric that will guide the optimization process e.g., ROC AUC, F-measure etc.

OptiML can be configured to allow the search to try all applicable model types (deepnets, logistic regressions, models, and ensembles) or a subset of them. However, if deepnets are selected, it won’t iterate over them because they already come with two automatic optimization options: automatic structure suggestion and automatic network search. In those instances, two deepnets, a search and a structure suggestion, will automatically be executed as part of the model optimization.

On a related note, even though we consider them part of the supervised learning toolbox, Time Series are not included in the scope of OptiML as time series datasets tend to present a different type of data structure that are best treated differently than the other supervised tasks mentioned.

Finally, for completeness sake, in addition to finding the best supervised model among several algorithms with OptiML, we have also enabled the Automatic Optimization option for models, ensembles, logistic regression, and deepnets, separately. This means that you no longer need to manually tune any of your supervised models to achieve the best results. Instead, you can simply select the Automatic Optimization option and BigML will execute this task for your chosen algorithm only. Once complete, it will similarly return the top performing model along with its related parameter values.

The Algorithm

The OptiML algorithm is split into two phases. The first, the “parameter search” phase, uses a single holdout set to iteratively find promising sets of parameters. The second, the “validation” phase is used to iteratively perform Monte Carlo cross-validation on those parameters that are somewhat close to the best.

For this second phase, the algorithm iteratively does new train/test splits for the top half of algorithms remaining. Thus, the best models will typically have more than one evaluation associated with them.

The two phases are both governed by an argument specifying the maximum training time allowed. So BigML halts a given phase of the algorithm when it goes over time in that phase. It does, however, guarantee that at least one iteration of each phase will complete before returning. Thus, in extreme cases, such as massive datasets coupled with very low maximum training times, it may overrun the said maximum training time significantly.

Want to know more about OptiML?

If you have any questions or you would like to learn more about how OptiML works, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

BigML Release: Automatically Find the Optimal Machine Learning Model with OptiML!

BigML’s new release is here! Join us on Wednesday, May 16, 2018, at 10:00 AM PDT (Portland, Oregon. GMT -07:00) / 07:00 PM CEST (Valencia, Spain. GMT +02:00) for a FREE live webinar to discover the latest version of the BigML platform. We will be presenting OptiMLa new BigML resource that automatically finds the best performing supervised model for your data to help you solve classification and regression problems with a single click.

BigML’s mission remains unchanged since its inception: to make Machine Learning easy and beautiful for everyone. As an important milestone in this journey, we are bringing our newest feature, OptiML, to the BigML Dashboard, API, and WhizzML.

Even after you define your Machine Learning problem, collect and pre-process data, and generate relevant features, it can still be very time-consuming and difficult to choose the best algorithm for your problem. For non-experts, this challenge becomes even more pronounced as they have to understand and configure many parameters before they land on an optimal model (often via trial and error), with no indication if they should continue surveying additional options. On the other hand, advanced users also appreciate the time savings resulting from best practice optimization techniques in searching their hypothesis spaces when benchmarking against their own hand-fit models. With BigML, regardless of your previous Machine Learning experience, you can now automatically tune your supervised models to quickly find an optimal model to effectively solve your classification or regression problems. The new OptiML capability listed under our supervised menu on BigML Dashboard enables just that with a single click!

OptiML automatically creates and evaluates multiple supervised models (decision trees, ensembles, logistic regressions, and deepnets) with different configurations by using Bayesian parameter optimization. Put simply, OptiML tries new values for groups of parameters, trains models, evaluates them, and tries a new group of parameters based on the results of the previous trials. When this dynamic process finishes, you will get a list of the models best fitting your data, so you can compare them and select the one that best suits your predictive use case. OptiML is an integration of the SMACdown algorithm into the BigML Dashboard that puts more Machine Learning in your Machine Learning.

For ease of use and completeness sake, in addition to finding the best supervised model among several algorithms with OptiML, we have also enabled the Automatic Optimization option for models, ensembles, logistic regressions, and deepnets, individually. This means that you no longer need to manually tune any of your supervised models to achieve the best results. Instead, you can just select the Automatic Optimization option and BigML will execute an automatic optimization task for your chosen algorithm. Once complete, it will return the top performing model ready to be used.

Want to know more about OptiML?

If you have any questions or you would like to learn more about how OptiML works, please visit the release page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow as well as the full webinar recording.

%d bloggers like this: