GDPR Compliance and its Impact on Machine Learning Systems

Unless you’ve been hiding under a rock, you’ve probably heard of the Cambridge Analytica Scandal and Mark Zuckerberg’s statements about the worldwide changes Facebook is making in response to European Union’s General Data Protection Regulation (GDPR). If your business is not yet in Europe, you may be taken aback by the statement from U.S. Senator Brian Schatz that “all tech platforms ought to adopt the EU approach to (data protection)”. This, despite the fact that 45% of U.S. citizens think that there is already “too much” government regulation of business and industry.

Image source: Convert GDPR (

So yes, GDPR is a big deal indeed. When it becomes the law in European Union later this week on May 25, 2018, it will improve data protection for EU citizens dealing with companies not only in Europe but all around the world. In other words, whether your company is based in EU or not, as long as you have EU citizens as customers or users and you process their data, GDPR is very much relevant for your business.

There are many aspects of GDPR that cover various data processing best practices. One of the critical concepts is “Personal Data“. Personal data in GDPR are defined as anything that can be used to directly or indirectly identify an individual. The second concept you should get familiar with is “Personal Data Processing“. It is “any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction.” And then we have the concepts of the “Controller“, which determinates the purpose of processing of personal data and the “Processor“, which processes personal data on behalf of the Controller. Enough of definitions. We’re good to go right? Not so fast. The following example shows how quickly things get complicated.

A few days ago, I had a conversation with the representative of one of the biggest tech companies in the world who had presented to the audience a predictive application explaining how photos of the customers are being stored as they are queuing up for a service. After the presentation, I asked him about the effect of GDPR on the described application and he started talking about PII (Personally Identifiable Information) instead. PII is a concept from U.S. privacy laws and isn’t exactly overlapping with the personal data definition in GDPR and this can quickly turn out to be a very costly confusion for many more companies serving EU data subjects.

While companies large and small are wrestling with the waves of change in handling user data introduced by GDPR, we’d like to also turn our attention to how those changes impact Machine Learning efforts in the coming months and years.

How BigML helps manage GDPR impact on Machine Learning systems

In order to explain the effects of GDPR on Machine Learning let’s have a look at the three important rights that GDPR grants to the owner of personal data (or the “Data Subject” in GDPR parlance): Non-discrimination Right, the Right to Explanation and the Right to be Forgotten.  We’ll cover them in the order they appear in a typical Machine Learning workflows: starting with data wrangling and feature engineering, continuing with modeling and finishing with model deployment and management.

Data Wrangling and Feature Engineering

The first right of the data subject is the “Non-discrimination Right”.  GDPR is quite explicit when it comes to processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation.

These data points are incredibly valuable for certain legitimate use cases such as genetic research so they must not be taken literally as a “no-go” zone.  However, the data subjects at the very least need to be made aware and be given the ability to opt-in for such schemes should they choose to give their explicit consent. Regardless of opt-ins and the temptation to enrich data to further improve the model accuracy, there are clear lines that shouldn’t be crossed as expressed in the bestseller by Cathy O’Neil, “Weapons of Math Destruction”.  This book contains great examples of how human bias can be inherently hidden in your data and can get reinforced in the predictive models built on top of it if you are not careful e.g., even zip codes can sometimes result in racial discrimination.

To clarify, on the BigML platform, we make a clear distinction between personal data such as emails, credit card details etc. and data meant for Machine Learning use.  The former is required for continuing our services without interruptions and do not factor in your Machine Learning workflows. As for the latter, you can easily see, filter and add new fields to your datasets or plot the correlations between various fields by using the dynamic scatterplot dataset visualization capability if you suspect certain fields may be proxies for more troublesome variables you’d rather stay away from during your modeling.  On the other hand, building an Association model can yield interesting statistically significant (association) rules that can point to built-in biases in your dataset. Stratified sampling techniques can also be good allies to ensure that your dataset contains well-balanced representations of the real-life phenomenon you’re looking to model in a way conducive to bias-free Machine Learning outcomes.

Modeling and Predictions

The second right is the “Right to Explanation”, referring to the need for the Controller and/or Processors to provide meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject. Authorities are still discussing how far this right should go and whether it’s necessary to explain all the minutiae related to data transformations that the drive predictive modeling process. This becomes an even bigger problem for workflows involving multiple algorithms some of which can be inherently difficult to interpret — think Deepnets.

By design, the BigML platform supports multiple capabilities that come to the rescue here.  From a global model perspective, each supervised learning model has a model summary report explaining which data fields had more or less impact on the model as a whole.  In addition, the visualizations for each ML resource allows the practitioner to better introspect and share the insights from her models.

From an individual prediction perspective, BigML supports Prediction Explanations both on the dashboard and the API.  In addition, batch predictions can be configured in a way that includes field importances per class or confidence values to augment predictions.

Prediction Explanation
An Example of Prediction Explanation on BigML.

Deployment, Retraining and Model Updates

The third right of data subjects is the “Right to be Forgotten”. This permits data subjects to have the controller erase all personal data concerning him or her. On the surface, this seems pretty straightforward.  Just delete the corresponding account and its data records and voila! But if we think along the Machine Learning workflow terms the question arises: Does this mean that data subjects have the right to demand that your predictive model gets retrained without their data? The interpretations as to where the line should be drawn can get quite tricky which leaves enough room for experts and consultants to operate in.

BigML has been designed from the ground up with an eye towards the key principles of traceability, repeatability, immutability, and programmability.  These design traits inherently help with GDPR compliance. Take, for example, the BigML platform’s reification capability, which helps trace back any workflow and its corresponding original resources that gave rise to a particular ML resource of interest.  This yields both process transparency and ultimately traceability.

One can also fully automate his Machine Learning workflows and ensure easy repeatability either with a single API call or through the execution of a corresponding WhizzML script.  Why is this important?  Well, because in the event of a retraining needed on a new dataset that may exclude certain records, the effort required is reduced to a single action.


We only scratched the surface on the importance and the potential impact of GDPR on the world of Machine Learning.  Hope this gives you some new thoughts on how your organization can best navigate this new regulatory climate while still being able to reach your goals.  As Machine Learning practitioners collectively experience the ups and downs of GDPR and learn the ropes, let BigML and it’s built-in qualities such as traceability, repeatability and interpretable models be an integral part of your action plan.

One comment

  1. Reblogged this on BLACK BOX PARADOX and commented:
    Recently, I closed and deleted my Facebook account. I am reasonably concerned about my privacy and my personal data protection. This article discusses new rules and regulations introduced by EU and how they will affect the way data analytics and ML is done on customer data.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s